# Pretraining `tok2vec` model using GPU accelerator

See https://github.com/explosion/projects/tree/master/ner-fashion-brands#-results for an explanation of the pretraining idea.

Prerequisites:
1. Make sure to enable the GPU accelerator in notebook settings
2. Check the version of CUDA, so that later you can specify it when installing the spaCy package
3. Verify if CuPy is available

In [None]:
# CUDA
!nvcc --version

In [None]:
# CuPy
import cupy
cupy.show_config()

**Install spaCy with GPU support**

https://spacy.io/usage#gpu

In [None]:
!pip install --upgrade --quiet spacy[cuda101]

**Download spaCy models**

In [None]:
!pip install --quiet https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz

**Define paths for data and models**

In [None]:
from pathlib import Path

DIR_DATA_INPUT = Path("../input/githubcord19/data/raw/")
FILE_RF_SENTENCES = DIR_DATA_INPUT / "cord_19_rf_sentences.jsonl"
FILE_ABSTRACTS_FILTERED = DIR_DATA_INPUT / "cord_19_abstracts_filtered.jsonl"
FILE_ABSTRACTS = DIR_DATA_INPUT / "cord_19_abstracts.jsonl"

DIR_MODELS = Path("/kaggle/working/models")
DIR_MODELS_RF_SENT = DIR_MODELS / "tok2vec_rf_sent_sci"
DIR_MODELS_ABS_FIL = DIR_MODELS / "tok2vec_abs_fil_sci"
DIR_MODELS_ABS = DIR_MODELS / "tok2vec_abs_sci"

**Delete old models**

In [None]:
!rm -rf $DIR_MODELS
!mkdir $DIR_MODELS

**Run pretraining**

In [None]:
# RF sentences
#!spacy pretrain $FILE_RF_SENTENCES en_core_sci_lg $DIR_MODELS_RF_SENT --use-vectors

Lowest loss
```
  #      # Words   Total Loss     Loss    w/s
997     23041824   7843654.07     6039   44406
```

In [None]:
# filtered abstracts
#!spacy pretrain $FILE_ABSTRACTS_FILTERED en_core_sci_lg $DIR_MODELS_ABS_FIL --use-vectors

Lowest loss
```
  #      # Words   Total Loss     Loss    w/s
975    130740080     45921767    36994   65769
```

In [None]:
# all abstracts
!spacy pretrain $FILE_ABSTRACTS en_core_sci_lg $DIR_MODELS_ABS --use-vectors

Lowest loss
```
  #      # Words   Total Loss     Loss    w/s

```