# Pretraining `tok2vec` model using GPU accelerator

See https://github.com/explosion/projects/tree/master/ner-fashion-brands#-results for an explanation of the pretraining idea.

Prerequisites:
1. Make sure to enable the GPU accelerator in notebook settings
2. Check the version of CUDA, so that later you can specify it when installing the spaCy package
3. Verify if CuPy is available

In [1]:
# CUDA
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243


In [2]:
# CuPy
import cupy
cupy.show_config()

CuPy Version          : 7.3.0
CUDA Root             : /usr/local/cuda
CUDA Build Version    : 10010
CUDA Driver Version   : 10010
CUDA Runtime Version  : 10010
cuBLAS Version        : 10201
cuFFT Version         : 10101
cuRAND Version        : 10101
cuSOLVER Version      : (10, 2, 0)
cuSPARSE Version      : 10300
NVRTC Version         : (10, 1)
cuDNN Build Version   : 7605
cuDNN Version         : 7605
NCCL Build Version    : 2506
NCCL Runtime Version  : 2506


**Install spaCy with GPU support**

https://spacy.io/usage#gpu

In [3]:
!pip install --upgrade --quiet spacy[cuda101]

[31mERROR: allennlp 0.9.0 has requirement spacy<2.2,>=2.1.0, but you'll have spacy 2.2.4 which is incompatible.[0m


**Download spaCy models**

In [4]:
!pip install --quiet https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz

**Define paths for data and models**

In [5]:
from pathlib import Path

DIR_DATA_INPUT = Path("../input/githubcord19/data/raw/")
FILE_RF_SENTENCES = DIR_DATA_INPUT / "cord_19_rf_sentences.jsonl"
FILE_ABSTRACTS_FILTERED = DIR_DATA_INPUT / "cord_19_abstracts_filtered.jsonl"
FILE_ABSTRACTS = DIR_DATA_INPUT / "cord_19_abstracts.jsonl"

DIR_MODELS = Path("/kaggle/working/models")
DIR_MODELS_RF_SENT = DIR_MODELS / "tok2vec_rf_sent_sci"
DIR_MODELS_ABS_FIL = DIR_MODELS / "tok2vec_abs_fil_sci"
DIR_MODELS_ABS = DIR_MODELS / "tok2vec_abs_sci"

**Delete old models**

In [6]:
!rm -rf $DIR_MODELS
!mkdir $DIR_MODELS

**Run pretraining**

In [7]:
# RF sentences
!spacy pretrain $FILE_RF_SENTENCES en_core_sci_lg $DIR_MODELS_RF_SENT --use-vectors --n-iter 300

[38;5;4mℹ Using GPU[0m
[38;5;2m✔ Created output directory:
/kaggle/working/models/tok2vec_rf_sent_sci[0m
[38;5;2m✔ Saved settings to config.json[0m
[2K[38;5;2m✔ Loaded input texts[0m
[2K[38;5;2m✔ Loaded model 'en_core_sci_lg'[0m
[1m
  #      # Words   Total Loss     Loss    w/s
  0        23088   23294.6172    23294   3650
  1        46176   45944.8516    22650   40396
  2        69264   68028.0195    22083   46421
  3        92352   89181.5820    21153   45866
  4       115440   109133.309    19951   46487
  5       138528   127856.213    18722   47088
  6       161616   145422.207    17565   45828
  7       184704   162030.438    16608   39500
  8       207792   177947.607    15917   41484
  9       230880   193441.597    15493   42476
 10       253968   208577.663    15136   41277
 11       277056   223513.362    14935   46759
 12       300144   238279.274    14765   44603
 13       323232   252958.997    14679   45366
 14       346320   267552.200

In [8]:
# filtered abstracts
!spacy pretrain $FILE_ABSTRACTS_FILTERED en_core_sci_lg $DIR_MODELS_ABS_FIL --use-vectors --n-iter 300

[38;5;4mℹ Using GPU[0m
[38;5;2m✔ Created output directory:
/kaggle/working/models/tok2vec_abs_fil_sci[0m
[38;5;2m✔ Saved settings to config.json[0m
[2K[38;5;2m✔ Loaded input texts[0m
[2K[38;5;2m✔ Loaded model 'en_core_sci_lg'[0m
[1m
  #      # Words   Total Loss     Loss    w/s
  0       133955   134987.812   134987   45409
  1       267910   266985.125   131997   61204
  2       401865   394697.070   127711   60037
  3       535820   517320.586   122623   59594
  4       669775   632906.445   115585   58063
  5       803730   741264.844   108358   58781
  6       937685   842898.406   101633   63591
  7      1071640   938653.484    95755   63368
  8      1205595   1031030.38    92376   62073
  9      1339550   1121114.27    90083   61384
 10      1473505   1209763.61    88649   58774
 11      1607460   1297069.85    87306   61790
 12      1741415   1383733.43    86663   61348
 13      1875370   1469757.71    86024   60267
 14      2009325   1555457.9

In [9]:
# all abstracts
!spacy pretrain $FILE_ABSTRACTS en_core_sci_lg $DIR_MODELS_ABS --use-vectors --n-iter 300

[38;5;4mℹ Using GPU[0m
[38;5;2m✔ Created output directory: /kaggle/working/models/tok2vec_abs_sci[0m
[38;5;2m✔ Saved settings to config.json[0m
[2K[38;5;2m✔ Loaded input texts[0m
[2K[38;5;2m✔ Loaded model 'en_core_sci_lg'[0m
[1m
  #      # Words   Total Loss     Loss    w/s
  0       242591   244346.219   244346   41559
  0       480209   478352.438   234006   49725
  0       715706   701571.656   223219   53037
  0       954634   918571.000   216999   53970
  0      1186111   1115243.00   196672   51872
  0      1430592   1306334.33   191091   54027
  0      1668892   1480930.36   174596   53345
  0      1901182   1643359.81   162429   52901
  0      2131623   1798876.42   155516   54358
  0      2371848   1957737.95   158861   54166
  0      2603850   2108804.58   151066   56472
  0      2842072   2262124.48   153319   53002
  0      3070133   2407879.59   145755   48071
  0      3307874   2558555.06   150675   49896
  0      3547717   2709980.62   1