# 03-build-model-basic

## Main objectives:

- Use the default training config as recommended by spaCy 3.0 for a text classification training model
    - [https://nightly.spacy.io/usage/training#quickstart](https://nightly.spacy.io/usage/training#quickstart)
- Use spaCy generated docs to train this model
- Run basic validation and evaluation 

In [1]:
import spacy
# load an english language model in spacy
nlp = spacy.load("en_core_web_sm")

In [2]:
!ls data

dev.spacy  train.spacy


## Fill in the config based on our base_config generated from spacy.io

In [3]:
!python -m spacy init fill-config ./config/base_config-basic.cfg ./config/config-basic.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config/config-basic.cfg
You can now add your data and train your pipeline:
python -m spacy train config-basic.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


# Validate configuration file

In [4]:
# validate configuration
!python -m spacy debug config ./config/config-basic.cfg

[1m
[1m
[1m
[38;5;2m✔ Config is valid[0m


In [5]:
!python -m spacy debug data ./config/config-basic.cfg

[1m
[38;5;2m✔ Corpus is loadable[0m
[38;5;2m✔ Pipeline can be initialized with data[0m
[1m
Language: en
Training pipeline: tok2vec, textcat
5000 training docs
5000 evaluation docs
[38;5;2m✔ No overlap between training and evaluation data[0m
[1m
[38;5;4mℹ 282161 total word(s) in the data (18163 unique)[0m
[38;5;4mℹ No word vectors present in the package[0m
[1m
[38;5;2m✔ 3 checks passed[0m


# Train the model

In [6]:
!python -m spacy train ./config/config-basic.cfg --output ./models/basic

[38;5;4mℹ Using CPU[0m
[1m
[2021-03-22 23:04:34,058] [INFO] Set up nlp object from config
[2021-03-22 23:04:34,623] [INFO] Pipeline: ['tok2vec', 'textcat']
[2021-03-22 23:04:34,629] [INFO] Created vocabulary
[2021-03-22 23:04:34,629] [INFO] Finished initializing nlp object
[2021-03-22 23:04:51,552] [INFO] Initialized pipeline components: ['tok2vec', 'textcat']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
[34m[1mwandb[0m: W&B API key is configured (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: Tracking run with wandb version 0.10.23
[34m[1mwandb[0m: Syncing run [33mpeach-lion-21[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/garza/yelp-polarity[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/garza/yelp-polarity/runs/1upmd5tz[0m
[34m[1mwandb[0m: Run data is saved locally in /home/garza/notebooks/aicamp/yelp-project/wandb/run-20210322

# Evaluate our best model output and save metrics to disk

In [7]:
!python -m spacy evaluate ./models/basic/model-best ./data/dev.spacy --output ./evaluate/basic-metrics.json

[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   94.57 
SPEED                 27802 

[1m

               P       R       F
positive   88.20   87.28   87.74
negative   87.41   88.32   87.86

[1m

           ROC AUC
positive      0.95
negative      0.95

[38;5;2m✔ Saved results to evaluate/basic-metrics.json[0m


In [8]:
!python -m spacy debug data ./config/config-basic.cfg

[1m
[38;5;2m✔ Corpus is loadable[0m
[38;5;2m✔ Pipeline can be initialized with data[0m
[1m
Language: en
Training pipeline: tok2vec, textcat
5000 training docs
5000 evaluation docs
[38;5;2m✔ No overlap between training and evaluation data[0m
[1m
[38;5;4mℹ 282161 total word(s) in the data (18163 unique)[0m
[38;5;4mℹ No word vectors present in the package[0m
[1m
[38;5;2m✔ 3 checks passed[0m
