Training a document classifier using the built-in spaCy model (CNN). Code is modified from textcategorizer example in spaCy docs. 

In [1]:
import spacy
from spacy.util import minibatch, compounding
import json
from pathlib import Path
import pandas as pd 
from itertools import islice 
from sklearn import metrics
from collections import Counter

In [2]:
DATA_DIR = Path('../../data/wiki10')
TEXT_DIR = DATA_DIR / 'text' 
CLF_LABELS_PATH = DATA_DIR / 'clf0-singlelabel.csv'
MODEL_DIR = DATA_DIR / 'model' 

In [3]:
def load_data(n_texts, split=0.7):
    # load labels
    CLF_LABELS_PATH = DATA_DIR / 'clf0-singlelabel.csv'
    df = pd.read_csv(CLF_LABELS_PATH).sample(frac=1, random_state=0) # load and shuffle records
    df = df[:n_texts] 
    labels = df.tag.unique()
    cats = [{label.upper(): label == tag for label in labels} for tag in df.tag]
    # load texts
    texts = [TEXT_DIR.joinpath(id).read_text() for id in df.id]
    # split into train/test sets 
    split = int(len(df) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])

In [4]:
def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 1e-8  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 1e-8  # True negatives
    
    for i, doc in enumerate(textcat.pipe(docs, n_threads=1, batch_size=20)): # fixing n_threads to avoid memory error
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = 2 * (precision * recall) / (precision + recall)
    return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}

In [8]:
def main(nlp, n_iter, output_dir, n_texts=2000):
    # add the text classifier to the pipeline if it doesn't exist
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'textcat' not in nlp.pipe_names:
        textcat = nlp.create_pipe('textcat')
        nlp.add_pipe(textcat, last=True)
    # otherwise, get it, so we can add labels to it
    else:
        textcat = nlp.get_pipe('textcat')

    # add labels to text classifier
    textcat.add_label('MUSIC')
    textcat.add_label('FOOD')
    textcat.add_label('RELIGION')
    textcat.add_label('SOFTWARE')
    textcat.add_label('POLITICS')
    textcat.add_label('MATH')
    
    print("Loading data...")
    (train_texts, train_cats), (test_texts, test_cats) = load_data(n_texts=n_texts)
    print("Loaded examples ({} training, {} evaluation)"
          .format(len(train_texts), len(test_texts)))
    train_data = list(zip(train_texts,
                          [{'cats': cats} for cats in train_cats]))
    
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
    with nlp.disable_pipes(*other_pipes):  # only train textcat
        optimizer = nlp.begin_training()
        print("Training the model...")
        for i in range(n_iter):
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(train_data, size=compounding(4., 32., 1.001))
            for i_batch, batch in enumerate(batches):
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
        with textcat.model.use_params(optimizer.averages):
            # evaluate on the dev data split off in load_data
            true_labels, pred_labels = get_labels(nlp.tokenizer, textcat, test_texts, test_cats)
            print(metrics.classification_report(true_labels, pred_labels))
            print(f"LOSS: {losses['textcat']:.3f}")
            
    # test the trained model
    #test_text = "Grapes are red or green"
    #doc = nlp(test_text)
    #print(test_text, doc.cats)
    
    output_dir = Path(output_dir)
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.to_disk(output_dir)
    print("Saved model to", output_dir)

In [10]:
%%time
nlp = spacy.load('en_core_web_md', disable=['parser', 'tagger', 'ner']) 
main(nlp, n_iter=2, output_dir=MODEL_DIR, n_texts=1800)

Loading data...
Loaded examples (1260 training, 540 evaluation)
Training the model...
LOSS 	  P  	  R  	  F  
140.330	0.942	0.757	0.840
70.210	0.955	0.746	0.838
Saved model to ../../data/wiki10/model
CPU times: user 15min 3s, sys: 8.12 s, total: 15min 11s
Wall time: 15min 35s


In [12]:
%%time
nlp = spacy.load('en_core_web_md', disable=['parser', 'tagger', 'ner']) 
main(nlp, n_iter=2, output_dir=MODEL_DIR, n_texts=1800)

Loading data...
Loaded examples (1260 training, 540 evaluation)
Training the model...
LOSS 	  P  	  R  	  F  
196.336	0.962	0.426	0.591
158.882	0.959	0.433	0.597
Saved model to ../../data/wiki10/model
CPU times: user 15min 1s, sys: 6.73 s, total: 15min 8s
Wall time: 15min 36s


It appears that simply re-running the model can produce a very different result, in terms of performance. As we have fixed the random_state of the training sample, the only thing (I think!) which could make the result not deterministic is the batches used to train the model. Will investigate this. 

Also note the difference in precision and recall, this is likely due to the evaluation process. It is set up to handle multilabel data (n-hot encoded target) but we are using a single label for each document. The metric considers each label for each document as a separate outcome so in our case of 6 categories, a correct prediction for a single document which actually be scored as 6 correct predictions (assuming the probability for the true class is >= 0.5 and the probability for all the false classes < 0.5).

Changing the evaluation to consider each output class separately.

In [6]:
def get_labels(tokenizer, textcat, texts, cats):
    ''' output list of true labels and predicted labels for sklearn.evaluate input '''
    docs = (tokenizer(text) for text in texts) # tokenizing/textcat outside of pipeline to allow custom weights (moving average)
    
    true_labels = [max(cat_dict, key=cat_dict.get) for cat_dict in cats]
    pred_labels = [max(doc.cats, key=doc.cats.get) for doc in textcat.pipe(docs, n_threads=1, batch_size=20)]
    
    return true_labels, pred_labels

In [7]:
%%time
nlp = spacy.load('en_core_web_md', disable=['parser', 'tagger', 'ner']) 
main(nlp, n_iter=2, output_dir=MODEL_DIR, n_texts=1800)

Loading data...
Loaded examples (1260 training, 540 evaluation)
Training the model...
LOSS 	  P  	  R  	  F  


  'precision', 'predicted', average, warn_for)


             precision    recall  f1-score   support

       FOOD       0.00      0.00      0.00        93
       MATH       0.80      0.97      0.88        79
      MUSIC       0.46      0.96      0.62        99
   POLITICS       0.00      0.00      0.00        88
   RELIGION       0.72      0.98      0.83        95
   SOFTWARE       0.77      0.95      0.85        86

avg / total       0.45      0.64      0.52       540

LOSS: 118.999
Saved model to ../../data/wiki10/model
CPU times: user 14min 1s, sys: 11.2 s, total: 14min 12s
Wall time: 14min 47s


In [9]:
%%time
nlp = spacy.load('en_core_web_md', disable=['parser', 'tagger', 'ner']) 
main(nlp, n_iter=2, output_dir=MODEL_DIR, n_texts=1800)

Loading data...
Loaded examples (1260 training, 540 evaluation)
Training the model...


  'precision', 'predicted', average, warn_for)


             precision    recall  f1-score   support

       FOOD       0.00      0.00      0.00        93
       MATH       0.76      0.99      0.86        79
      MUSIC       0.64      0.93      0.76        99
   POLITICS       0.92      0.95      0.94        88
   RELIGION       0.94      0.95      0.94        95
   SOFTWARE       0.76      0.94      0.84        86

avg / total       0.66      0.79      0.72       540

LOSS: 87.131
Saved model to ../../data/wiki10/model
CPU times: user 13min 59s, sys: 6.75 s, total: 14min 6s
Wall time: 14min 36s


We are still getting unstable performance, in some cases there are no predicted case predicted for a specific category. Will need to investigate this further, perhaps it is due to the (incorrect?) loss function in our multi-label model, as we have 6 classes and the correct class only contributes 1/6 to the overall loss.