# 05-build-model-TextCatCNN

## Main objectives:

- Use spaCy's neural netowkr model where token vectors are calculated using a CNN
    - [https://spacy.io/api/architectures#TextCatCNN](https://spacy.io/api/architectures#TextCatCNN)
- Use spaCy generated docs to train this model
- Run basic validation and evaluation 

In [1]:
import spacy
# load an english language model in spacy
nlp = spacy.load("en_core_web_lg")

In [2]:
!ls data

dev.spacy  train.spacy


# Validate configuration file

In [3]:
# validate configuration
!python -m spacy debug config ./config/config-TextCatCNN.cfg

2021-03-16 22:56:35.819839: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1m
[1m
[1m
[38;5;2m✔ Config is valid[0m


# Train the model

In [1]:
!python -m spacy train ./config/config-TextCatCNN.cfg --output ./models/textCatCNN

2021-03-16 23:23:40.099463: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Using CPU[0m
[1m
[2021-03-16 23:23:40,999] [INFO] Set up nlp object from config
[2021-03-16 23:23:41,007] [INFO] Pipeline: ['textcat']
[2021-03-16 23:23:41,010] [INFO] Created vocabulary
[2021-03-16 23:23:41,010] [INFO] Finished initializing nlp object
[2021-03-16 23:23:44,661] [INFO] Initialized pipeline components: ['textcat']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ----------  ------
  0       0          0.10       49.27    0.49
  0     100         23.88       46.77    0.47
  0     200         24.35       55.04    0.55
  0     300         26.19       51.92    0.52
  0     400         19.42       51.91    0.52
  0     500         20.95       51.92    0.52
  0     600         15.

# Evaluate our best model output and save metrics to disk

For hyperparameter tuning, experimented with different tok2vec attributes and row sizes settling on:

- "NORM", "PREFIX", "SUFFIX", "SHAPE"
- [10000, 5000, 5000, 5000]

Also adjusted the width size from 64 to 96, which only resulted in a nominal increase in performance.

Training was tested on training datasets of 100, 500, 1000 and finally, 5000.

In [2]:
!python -m spacy evaluate ./models/textCatCNN/model-best ./data/dev.spacy --output ./evaluate/model-textCatCNN-metrics.json

2021-03-16 23:43:03.898740: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   91.81 
SPEED                 38864 

[1m

               P       R       F
positive   87.12   82.28   84.63

[1m

           ROC AUC
positive      0.92

[38;5;2m✔ Saved results to evaluate/model-textCatCNN-metrics.json[0m
