# Tutorial on training Engagement Analyzer using Spacy spancat model.

**Authors**: Anonymized

**LastUpdate**: 



# Overview

This step-by-step tutorial showcases the process of training a toy version of Engagement Analyzer (Eguchi & Kyle, 2023) with spaCy spancat component.

**This tutorial is intended to be run on Google Colaboratory.**

# Setting up the Colab environment

The following code verifies which Graphical Processing Unit (GPU) being used in the current session.

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu May 30 01:49:49 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

Not using a high-RAM runtime


## Mounting GoogleDrive

To run the code, you will need to grant the curreng Google Colabnotebook to access your google drive. This can be done by running the following code.

In [3]:
# Mount Google Drive
from google.colab import drive # import drive from google colab

ROOT = "drive"     # default location for the drive
print(ROOT)                 # print content of ROOT (Optional)

drive.mount(ROOT)

drive
Mounted at drive


## Changing the directory

- Make sure that you clone the GitHub repository into the folder "Colab Notebooks" under MyDrive.
- Run the following code to change the directory to the engagement-analyzer-train

In [4]:
cd /content/drive/MyDrive/'Colab Notebooks'/engagement-analyzer-train-E867

/content/drive/MyDrive/Colab Notebooks/engagement-analyzer-train


### (Optional step) Setting up wandb package for hyperparameter search

In [None]:
# OPTIONAL: IF you want to track the ML experiment results using wandb, please install and login.
# This is useful when you train multiple models at the same time and analyze the results side-by-side later.
# !pip install wandb
# !wandb login

## Installing necessary packages

To successfully run the current tutorial, you will first need to install (and overwrite) the spacy package on Colab environment.
This can be done by running the following code.

In [5]:
!pip install --upgrade pip setuptools wheel


Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
Collecting setuptools
  Using cached setuptools-70.0.0-py3-none-any.whl (863 kB)
Installing collected packages: setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 67.7.2
    Uninstalling setuptools-67.7.2:
      Successfully uninstalled setuptools-67.7.2
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, which is not installed.[0m[31m
[0mSuccessfully installed pip-24.0 setuptools-70.0.0


In [6]:
!pip3 uninstall spacy-curated-transformers

[0m

In [7]:
!pip3 install 'spacy==3.4.4' 'spacy-experimental==0.6.1'  'spacy-transformers==1.1.7' 'transformers==4.20.1' 'torch==1.12.1' torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116
Collecting spacy==3.4.4
  Downloading spacy-3.4.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (24 kB)
Collecting spacy-experimental==0.6.1
  Downloading spacy_experimental-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting spacy-transformers==1.1.7
  Downloading spacy_transformers-1.1.7-py2.py3-none-any.whl.metadata (6.0 kB)
Collecting transformers==4.20.1
  Downloading transformers-4.20.1-py3-none-any.whl.metadata (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.3/77.3 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch==1.12.1
  Downloading https://download.pytorch.org/whl/cu116/torch-1.12.1%2Bcu116-cp310-cp310-linux_x86_64.whl (1904.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 GB[0m [31m844.8 kB/s[0m eta [36m0:00:00[0m
Collecting thinc<8.2.0,>=8.1.0 (from 

The following torch is compatible with the Engagement Analyzer.

In [8]:
import cupy
print(cupy.__version__)


12.2.0


You can check if the install was successful by running the following.

In [9]:
!pip list -v

Package                          Version               Location                                Installer
-------------------------------- --------------------- --------------------------------------- ---------
absl-py                          1.4.0                 /usr/local/lib/python3.10/dist-packages pip
aiohttp                          3.9.5                 /usr/local/lib/python3.10/dist-packages pip
aiosignal                        1.3.1                 /usr/local/lib/python3.10/dist-packages pip
alabaster                        0.7.16                /usr/local/lib/python3.10/dist-packages pip
albumentations                   1.3.1                 /usr/local/lib/python3.10/dist-packages pip
altair                           4.2.2                 /usr/local/lib/python3.10/dist-packages pip
annotated-types                  0.7.0                 /usr/local/lib/python3.10/dist-packages pip
anyio                            3.7.1                 /usr/local/lib/python3.10/dist-packages pi

# Setting up the spacy project by running "install" command

- spaCy package has commands to set up the training environment.
- This set-up can be done using the following command.

In [10]:
!python -m spacy project run install

2024-05-30 01:54:01.499571: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-30 01:54:01.499635: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-30 01:54:01.622484: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-30 01:54:01.857793: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-30 01:54:06.893355: I external/local_

In [None]:
#!export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512'
#!export 'PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.8'

## Installing two spaCy off-the-shelf models

We will first download the following two default spacy models.
- `en_core_web_lg` = used to train the baseline model with static vector spaces.
- `en_core_web_trf` = used to train the dual-transformer model and run dependency parser for subtree candidate generator.

In [11]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [12]:
!python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.7.3
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-curated-transformers<0.3.0,>=0.2.0 (from en-core-web-trf==3.7.3)
  Downloading spacy_curated_transformers-0.2.2-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting curated-transformers<0.2.0,>=0.1.0 (from spacy-curated-transformers<0.3.0,>=0.2.0->en-core-web-trf==3.7.3)
  Downloading curated_transformers-0.1.1-py2.py3-none-any.whl.metadata (965 bytes)
Collecting curated-tokenizers<0.1.0,>=0.0.9 (from spacy-curated-transformers<0.3.0,>=0.2.0->en-core-web-trf==3.7.3)
  Downloading curated_tokenizers-0.0.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Downloading spacy_curated_transformers-0.2.2-py2.py3-none-any.whl (236 kB)
[2K   [90m━━━━━━━━━━

# STEP 8—Data preprocessing

The following command will convert the data in IOB format to the .spacy binary file for training.

Input = data in `IOB format`

Output = train.spacy/ dev.spacy/ test.spacy

In [13]:
!python -m spacy project run preprocess_engagementv3

[1m
[38;5;4mℹ Skipping 'preprocess_engagementv3': nothing changed[0m


## Training the model

Training can be initiated by running the following code.

In [14]:
!python -m spacy project run spancat

[38;5;4mℹ Running workflow 'spancat'[0m
[1m
Running command: /usr/bin/python3 -m spacy train configs/subtree/lg.cfg --output training/spancat/engagement_spl/lg_subtree/ --paths.train data/engagement_spl_train.spacy --paths.dev data/engagement_spl_dev.spacy --gpu-id 0 --vars.spans_key sc -c ./scripts/custom_functions.py
[38;5;4mℹ Saving to output directory:
training/spancat/engagement_spl/lg_subtree[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'spancat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ------------  ----------  ----------  ----------  ------
  0       0          0.00       4770.40        0.23        0.11       39.95    0.00
  0     200          0.00       9552.35        0.00        0.00        0.00    0.00
  0     400          0.00       1937.99       17.48       46.95       10.74    0.17
  0    

# Evaluating the model

Once the training is completed, you can run the following command to evaluate the final model. It will run the following code.



In [None]:
!python -m spacy project run evaluate_spancat


[38;5;1m✘ Can't find project.yml[0m
/content/project.yml



# Make this into a python package

In [None]:
!python -m spacy project run package

[1m
Running command: /usr/bin/python3 -m spacy package training/spancat/engagement_three/RoBERTa_subtree/model-best packages --name engagement_three_RoBERTa --version 1.10.0 --code ./scripts/custom_functions.py --force --build wheel
[38;5;4mℹ Building package artifacts: wheel[0m
[38;5;2m✔ Including 1 Python module(s) with custom code[0m
[38;5;2m✔ Including 2 package requirement(s) from meta and config[0m
spacy-transformers>=1.1.7,<1.2.0, spacy-experimental>=0.6.1,<0.7.0
[38;5;2m✔ Loaded meta.json from file[0m
training/spancat/engagement_three/RoBERTa_subtree/model-best/meta.json
[38;5;2m✔ Generated README.md from meta.json[0m
[38;5;2m✔ Successfully created package directory
'en_engagement_three_RoBERTa-1.10.0'[0m
packages/en_engagement_three_RoBERTa-1.10.0
running bdist_wheel
running build
running build_py
Generating grammar tables from /usr/lib/python3.8/lib2to3/Grammar.txt
Generating grammar tables from /usr/lib/python3.8/lib2to3/PatternGrammar.txt
creating build
creatin