# Transfer learning for text mining

Practical course material for the ASDM Class 09 (Text Mining) by Florian Leitner.

© 2019 Florian Leitner. All rights reserved.

During this course, we've worked with two datasets: the *20 Newsgroups* (day 1) and the *Reuters-21578* (day 3) datasets.
In this last exercise, we will re-run the 20 Newsgroups classification with a neural network and try to beat [the best published results](https://pdfs.semanticscholar.org/f678/3875b62d1eecbb0160d1cc12e26b50e81612.pdf) from models *other* than neural networks.
(If you like, particularly 20 Newsgroups is the equivalent of "MINST" for computer vision, but for text mining.)

For those two corpora, the best (non-neural) baselines achieve around **90% accuracy on the 20 Newsgroups** coprus, using the official (roughly 2:1) split. On day one, we achieved around 85% accuracy on this set. So it will be interesting to see how simple it is to achieve better results with modern deep learning models. 

(For Reuters-21578, the state-of-the-art ("before deep learning") was a 94% micro-averaged $F_1$ Score using the official ModApte split, but only if selecting documents among the ten most frequent categories (200 or more documents), and a micro-averaged 89% $F_1$ Score is using all 90 categories, not just the top 10.
And, obviously, it is important that the evaluation follows a single multilabel classification setting, and is not split into 10 or 90 individual, binary classification problems... Most deep lerning literature only focuses on the simpler 10-categories subset of Reuter-21578, because otherwise there are too few examples to work with, so the number you probably should keep in mind for that set when reading a new deep learning paper using it is the 94% micro-averaged $F_1$ score.)

Our goal for this last tutorial will be trying to beat at least our initial results with a artificial neural network model, to better understand how powerful these very new machine learning techniques are (or if they are more powerful at all).

## Installation and setup

For this practical, we will be using *TensorFlow*, to build a simple text multi-class predictor with a neural network from Google Research, called [BERT](https://github.com/google-research/bert).

[Installing TensorFlow](https://www.tensorflow.org/install) and BERT is simple, but if you don't have a GPU, don't even bother:

```bash
conda install tensorflow-gpu bert-tensorflow
# or:
pip3 install tensorflow-gpu bert-tensorflow
```

Beyond that, we will be importing some of the more common library favorites for machine learning in Python: Pandas, Numpy, and SciKit-Learn.

In [70]:
import tensorflow as tf
import pandas as pd
import numpy as np

from bert import run_classifier
from bert import optimization
from bert import tokenization
from bert import modeling

from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import accuracy_score

BERT comes in may different models; We will use the smallest [English model, BERT-Base, uncased, as a zipped archive](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip). That file should be unzipped and placed in a folder called `bert_models` relative to this notebook:

In [2]:
BERT_VOCAB = 'bert_models/uncased_L-12_H-768_A-12/vocab.txt'
BERT_INIT_CHKPNT = 'bert_models/uncased_L-12_H-768_A-12/bert_model.ckpt'
BERT_CONFIG = 'bert_models/uncased_L-12_H-768_A-12/bert_config.json'

## Corpus setup

Next, we collect/load the corpus we are training on, both the training and test data (for evaluation).

In [5]:
train = fetch_20newsgroups()
test = fetch_20newsgroups(subset='test')

As always in machine learning, review the data and get a feeling for it:

In [103]:
print("------------ TRAIN ------------")
print(train.data[0].strip())
print("\nLABEL:", train.target_names[train.target[0]],
      "=", train.target[0])

print("\n\n------------ TEST ------------")
print(test.data[-1].strip())
print("\nLABEL:", test.target_names[test.target[-1]],
      "=", test.target[-1])

------------ TRAIN ------------
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----

LABEL: rec.autos = 7


------------ TEST ------------
From: adamsj@gtewd.mtv.gtegsc.com
Subject: Re: Homosexuality issues in Christianity
Reply-To: adamsj@gtewd.mtv.gtegsc.com
Organization: GTE Govt. Systems, Electronics Def. Div.
Lines: 18

In artic

<hr/>

### BERT corpus transformation

Next, we need to transform the training set into BERT input examples.

In [6]:
def create_examples(documents, labels, label_names):
    """Creates examples for the training and dev sets."""
    examples = []
    
    for i, text in enumerate(documents):
        if labels is not None and len(labels):
            l = label_names[labels[i]]
        else:
            l = None
            
        examples.append(
            run_classifier.InputExample(
                guid=i, 
                text_a=text,
                label=l,
            )
        )
        
    return examples

In [74]:
train_examples = create_examples(
    train.data,
    train.target,
    train.target_names,
)

len(train_examples), train_examples[-1].guid

(11314, 11313)

### Word-piece tokenization

Equally important is the tokenization BERT uses, which is based on Word-Piece tokenization.
Roughly, that can be understood as splitting words at syllabi, but this word sub-splitting is based on statistical properties of the sub-word sequences, not on phonetic or linguistic properties.
That gives BERT the advantage that it can handle unseen words much better than if using word-level tokenization, as long as the vocabulary covers some of the word-pieces.

In [80]:
tokenization.validate_case_matches_checkpoint(
    True,
    BERT_INIT_CHKPNT,
)

tokenizer = tokenization.FullTokenizer(
    vocab_file=BERT_VOCAB,
    do_lower_case=True,
)

### Feature conversion

With the tokenizer in place, we can convert the examples into so-called feature vectors, that are the input for BERT. Review the output from this cell, as it shows word-piece tokens and generally, how the input to BERT looks like.

In [82]:
train_features = run_classifier.convert_examples_to_features(
    train_examples, 
    train.target_names,
    max_seq_len,
    tokenizer,
)

INFO:tensorflow:Writing example 0 of 11314
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: 0
INFO:tensorflow:tokens: [CLS] from : le ##r ##x ##st @ wa ##m . um ##d . ed ##u ( where ' s my thing ) subject : what car is this ! ? n ##nt ##p - posting - host : ra ##c ##3 . wa ##m . um ##d . ed ##u organization : university of maryland , college park lines : 15 i was wondering if anyone out there could en ##light ##en me on this car i saw the other day . it was a 2 - door sports car , looked to be from the late 60s / early 70s . it was called a brick ##lin . the doors were really small . in addition , the front bumper was separate from the rest of the body . this is all i know . if anyone can tell ##me a model name , engine spec ##s , years of production , where this car is made , history , or whatever info you have on this funky looking car , please e - mail . thanks , - il - - - - brought to you by your neighborhood le ##r ##x ##st - - - - [SEP]
INFO:tensorflow:input_ids: 101 2013 1

Naturally, we need to apply the *exact same* feature generation proceedure to out test dataset:

In [89]:
test_examples = create_examples(
    test.data, 
    test.target,
    test.target_names,
)

In [90]:
test_features = run_classifier.convert_examples_to_features(
    test_examples,
    test.target_names, 
    max_seq_len,
    tokenizer,
)

INFO:tensorflow:Writing example 0 of 7532
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: 0
INFO:tensorflow:tokens: [CLS] from : v ##0 ##64 ##mb ##9 ##k @ u ##b ##v ##ms ##d . cc . buffalo . ed ##u ( neil b . gan ##dler ) subject : need info on 88 - 89 bonn ##eville organization : university at buffalo lines : 10 news - software : va ##x / v ##ms v ##ne ##ws 1 . 41 n ##nt ##p - posting - host : u ##b ##v ##ms ##d . cc . buffalo . ed ##u i am a little confused on all of the models of the 88 - 89 bonn ##eville ##s . i have heard of the le se l ##se ss ##e ss ##ei . could someone tell me the differences are far as features or performance . i am also curious to know what the book value is for prefer ##ea ##bly the 89 model . and how much less than book value can you usually get them for . in other words how much are they in demand this time of year . i have heard that the mid - spring early summer is the best time to buy . neil gan ##dler [SEP]
INFO:tensorflow:input_ids: 101 2013 102

## BERT model training

First, we define a few of the hyper-parameters we will use for the fine-tuning of the BERT model.

In [79]:
max_seq_len = 512  # max. n. of word-piece tokens (max=512)
batch_size = 4  # batch size during training (probably memory-constrained)
train_epochs_prop = 2.0  # number/fraction of epochs to train 
warmup_prop = 0.1  # proportion of warmup data before training

# with the above, now calculate the actual n. of training steps to take
num_train_steps = int(len(train_examples) / 
                      batch_size * train_epochs_prop)
num_warmup_steps = int(num_train_steps * warmup_prop)
print(num_train_steps, num_warmup_steps)

5657 565


Next, we load the BERT model itself from its configuration file, and limit training to a single GPU device:

In [83]:
bert_config = modeling.BertConfig.from_json_file(
    BERT_CONFIG
)
gpu_device = tf.contrib.distribute.OneDeviceStrategy(
    "device:GPU:0"
)

Now, we make use of [TensorFlow's Estimator API](https://www.tensorflow.org/guide/estimators) to build and run the models; First, we need to create a "model function", that is, a function that provides the model that the Estimator API should run.

In [84]:
model_fn = run_classifier.model_fn_builder(
    bert_config=bert_config,
    num_labels=len(train.target_names),
    init_checkpoint=BERT_INIT_CHKPNT,
    learning_rate=2e-5,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=False,
    use_one_hot_embeddings=False,
)

Before training can commence, we need to set up a runtime configuration for the Estimator API; We use the TPUEstimator implementation (TPU's are Google's special hardware for processing tensors), so you (probably) can run this notebook on a Google Colab notebook (never tested!). We will store the final, trained model in folder next to this notebook called `trained_model`. 

In [85]:
run_config = tf.contrib.tpu.RunConfig(
  cluster=None,
  master=None,
  model_dir="trained_model",
  train_distribute=gpu_device,
  save_checkpoints_steps=99999999,
  tpu_config=tf.contrib.tpu.TPUConfig(
      num_cores_per_replica=None,
      iterations_per_loop=1000,
  ),
)

INFO:tensorflow:Initializing RunConfig with distribution strategies.
INFO:tensorflow:Not using Distribute Coordinator.


In addtion, we need to set up the actual Estimator object itself that will manage our trained model.

In [86]:
estimator = tf.contrib.tpu.TPUEstimator(
      use_tpu=False,
      model_fn=model_fn,
      config=run_config,
      train_batch_size=batch_size,
      predict_batch_size=1,
)

INFO:tensorflow:Using config: {'_save_summary_steps': 100, '_service': None, '_train_distribute': <tensorflow.contrib.distribute.python.one_device_strategy.OneDeviceStrategy object at 0x7fc8b80f7a20>, '_is_chief': True, '_distribute_coordinator_mode': None, '_log_step_count_steps': None, '_cluster': None, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_save_checkpoints_steps': 99999999, '_num_ps_replicas': 0, '_protocol': None, '_tf_random_seed': None, '_model_dir': 'trained_models', '_num_worker_replicas': 1, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=None, num_cores_per_replica=None, per_host_input_for_training=2, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None), '_save_checkpoints_secs': None, '_evaluation_master': '', '_task_type': 'worker', '_task_id': 0, '_keep_checkpoint_max': 5, '_devi

To feed training data to the Estimator, we need an input function, similar to the model function; That is, a function that will provide the Estimator the training data for the model in a specific format and as a specific data type per instance.

In [87]:
input_fn = run_classifier.input_fn_builder(
    train_features,
    max_seq_len,
    is_training=True,
    drop_remainder=True,
)

Finally, now that everything is configured and set up, the Estimator API can be used to train the model for the number of steps we established much earlier.

In [88]:
estimator.train(
    input_fn=input_fn,
    max_steps=num_train_steps,
)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running train on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (4, 500)
INFO:tensorflow:  name = input_mask, shape = (4, 500)
INFO:tensorflow:  name = label_ids, shape = (4,)
INFO:tensorflow:  name = segment_ids, shape = (4, 500)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (30522, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/query/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/l

<tensorflow.contrib.tpu.python.tpu.tpu_estimator.TPUEstimator at 0x7fc8b80f7f28>

## Model Evaluation

Now that we have finished training fine-tunining the model to our dataset and generating the classes we want, it is time to test it.

As during training, we need an input function for the Estimator that provides the test data:

In [91]:
input_fn_test = run_classifier.input_fn_builder(
    test_features,
    max_seq_len,
    is_training=False,
    drop_remainder=True,
)

With that, and thanks to the Estimator API, we are immediately ready to run the test cases and make predictions on them.

In [92]:
result = estimator.predict(input_fn=input_fn_test)
predictions = np.array(np.argmax(r['probabilities']) for r in result)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running infer on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (1, 500)
INFO:tensorflow:  name = input_mask, shape = (1, 500)
INFO:tensorflow:  name = label_ids, shape = (1,)
INFO:tensorflow:  name = segment_ids, shape = (1, 500)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (30522, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_0/attention/self/query/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/l

In [101]:
test.target, predictions

(array([ 7,  5,  0, ...,  9,  6, 15]), array([ 7,  5,  0, ...,  9, 12, 15]))

That we can use directly with the SciKit-Learn `metrics` API to evaluate the result:

In [102]:
accuracy_score(test.target, predictions)

0.861125862984599

Geat! We've built a neural network that learns to classify documents.
Yet, note that it takes *ages* to train compared to the models we trained on day 1: This model takes almost an hour on a regular GPU, versus a few seconds needed to train the traditional models. But, it does beat the accuracy of the best models from day 1, at 86% vs. 85%. And, if you were to use the BERT-Large model or train more carefully (e.g., including pre-training on the type of documents we see here), you probably still could squeeze out some more performance. As that would be contrary to the off-the-shelf comparison we are tying to do here, this is left as an exercie to the interested: That is to say, with sufficient dedication, you should be able to beat Erkan's 2011 SOTA on this dataset of around 90% accuracy using some version of BERT.

## Conclusion

Overall, this notebook is mostly here to show you why you should focus on simple things first: If you cannot produce something meaningful with an old-school model, that is either because

1. You do not have sufficient training data and need to make use modern transfer learning.
2. Your dataset is not adequate to solve the problem at hand; E.g., too noisy data/annotations is a pretty common issue.

(And if you are not sure what problem it is, you should assume it is the latter.)
You will save yourself lots of time and pain if you first start with the simple models we saw on day one to evaluate how far you will be able to go with the data you have.

Overall: Can one beat the state-of-the-art in text classification with deep learning? Yes, by using the very latest transfer learning techniques from Google, AllenAI, and others. But even then, remember: It is still tricky, and requires significant (hardware) resources.

## Take-home message

In the opinion of your instructor, particularly the deep learning literature is *littered* with evaluation results that claim to beat all former state-of-the-art, but indeed are quite frequently not much better, or ex-aequo, and often even worse. That typically happens because the evalution conditions are poorly chosen and do not match the earlier literature.

Computer vision (CV), machine translation (MT), and dependency parsing (DP) are the famous cases where deep learning indeed has "pushed the envelope" by a substantial margin *on the **same**, **public** (and often, small) community datasets* for evaluating the approach and comparing it to existing methods. Yet, nearly no paper at all discusses how much more resources go into setting up, developing, training, and using deep learning models. And, at least for NLP and text mining, the gap between current deep learning models and all the other, equally important approaches is by far not as wide as what we saw in CV and MT. 