# 06-build-model-TextCatEnsemble

## Main objectives:

- Use spaCy's TextCatEnsemble to train a stacked ensemble of a linear bag-of-words model and a neural network model.
    - [https://spacy.io/api/architectures#TextCatEnsemble](https://spacy.io/api/architectures#TextCatEnsemble)
- Use spaCy generated docs to train this model
- Run basic validation and evaluation 

In [41]:
import spacy
# load an english language model in spacy
nlp = spacy.load("en_core_web_lg")

In [1]:
!ls data

dev.spacy  train.spacy


# Validate configuration file

In [2]:
# validate configuration
!python -m spacy debug config ./config/config-TextCatEnsemble.cfg

2021-03-16 22:57:29.042865: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1m
[1m
[1m
[38;5;2m✔ Config is valid[0m


# Train the model

In [3]:
!python -m spacy train ./config/config-TextCatEnsemble.cfg --output ./models/textCatEnsemble

2021-03-16 22:57:30.912176: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Using CPU[0m
[1m
[2021-03-16 22:57:31,952] [INFO] Set up nlp object from config
[2021-03-16 22:57:31,962] [INFO] Pipeline: ['tok2vec', 'textcat']
[2021-03-16 22:57:31,966] [INFO] Created vocabulary
[2021-03-16 22:57:31,966] [INFO] Finished initializing nlp object
[2021-03-16 22:57:39,291] [INFO] Initialized pipeline components: ['tok2vec', 'textcat']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ------------  ----------  ------
  0       0          0.00          0.00        0.32    0.00
  0     100          0.00         16.43       76.85    0.77
  0     200          0.00         12.06       25.84    0.26
  0     300          0.00         14.37       76.02    0.7

# Evaluate our best model output and save metrics to disk

For hyperparameter tuning, experimented with different BOW attributes and TexCatEnsemble parameters:

- ngram_size = 4 (TODO: should try again with ngram_size = 2 after doign BOW analysis)
- adjusted width to 128, nominal performance gain
- tok2vec model embed attributed modified to include:
    - "ORTH", "LOWER", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ID"
    - [2000, 2000, 1000, 1000, 1000, 1000]

Also adjusted the width size from 64 to 96, which only resulted in a nominal increase in performance.

Training was tested on training datasets of 100, 500, 1000 and finally, 5000.

In [4]:
!python -m spacy evaluate ./models/textCatEnsemble/model-best ./data/dev.spacy --output ./evaluate/model-textCatEnsemble-metrics.json

2021-03-16 23:08:20.469265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Using CPU[0m
[1m

TOK                 100.00
TEXTCAT (macro F)   80.64 
SPEED               29524 

[1m

                P       R       F
positive   100.00   67.56   80.64

[1m

           ROC AUC
positive      0.89

[38;5;2m✔ Saved results to evaluate/model-textCatEnsemble-metrics.json[0m
