# Pretraining `tok2vec` model using GPU accelerator

See https://github.com/explosion/projects/tree/master/ner-fashion-brands#-results for an explanation of the pretraining idea.

Prerequisites:
1. Make sure to enable the GPU accelerator in notebook settings
2. Check the version of CUDA, so that later you can specify it when installing the spaCy package
3. Verify if CuPy is available

In [None]:
# CUDA
!nvcc --version

In [None]:
# CuPy
import cupy
cupy.show_config()

**Install spaCy with GPU support**

https://spacy.io/usage#gpu

In [None]:
!pip install --upgrade --quiet spacy[cuda101]

**Download spaCy models**

In [4]:
!pip install --quiet https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz

**Define paths for data and models**

In [1]:
from pathlib import Path

DIR_DATA_INPUT = Path("../input/githubcord19/data/raw/")
FILE_RF_SENTENCES = DIR_DATA_INPUT / "cord_19_rf_sentences.jsonl"
FILE_ABSTRACTS_FILTERED = DIR_DATA_INPUT / "cord_19_abstracts_filtered.jsonl"
FILE_ABSTRACTS = DIR_DATA_INPUT / "cord_19_abstracts.jsonl"

DIR_MODELS = Path("/kaggle/working/models")
DIR_MODELS_RF_SENT = DIR_MODELS / "tok2vec_rf_sent_sci"
DIR_MODELS_ABS_FIL = DIR_MODELS / "tok2vec_abs_fil_sci"
DIR_MODELS_ABS = DIR_MODELS / "tok2vec_abs_sci"

**Delete old models**

In [9]:
!rm -rf $DIR_MODELS
!mkdir $DIR_MODELS

**Run pretraining**

In [6]:
# RF sentences
!spacy pretrain $FILE_RF_SENTENCES en_core_sci_lg $DIR_MODELS_RF_SENT --use-vectors

[38;5;4mℹ Using GPU[0m
[38;5;2m✔ Created output directory[0m
[38;5;2m✔ Saved settings to config.json[0m
[2K[38;5;2m✔ Loaded input texts[0m
[2K[38;5;2m✔ Loaded model 'en_core_sci_lg'[0m
[1m
  #      # Words   Total Loss     Loss    w/s
  0        23088   23294.6172    23294   3468
  1        46176   45944.8516    22650   40843
  2        69264   68028.0547    22083   47017
  3        92352   89181.5859    21153   46686
  4       115440   109133.490    19951   46970
  5       138528   127856.344    18722   45596
  6       161616   145422.205    17565   46221
  7       184704   162030.652    16608   40386
  8       207792   177947.869    15917   43235
  9       230880   193441.955    15494   28720
 10       253968   208577.757    15135   45939
 11       277056   223513.554    14935   46610
 12       300144   238279.496    14765   43636
 13       323232   252958.950    14679   42354
 14       346320   267552.241    14593   46073
 15       369408   282073.733    14521   47408
 

168      3901872   1969394.93     9388   49356
169      3924960   1978814.65     9419   44854
170      3948048   1988163.50     9348   46885
171      3971136   1997551.59     9388   48213
172      3994224   2006883.23     9331   48917
173      4017312   2016172.83     9289   48977
174      4040400   2025513.22     9340   48311
175      4063488   2034780.08     9266   44279
176      4086576   2044064.79     9284   48997
177      4109664   2053311.60     9246   49309
178      4132752   2062577.75     9266   48660
179      4155840   2071809.96     9232   48116
180      4178928   2081009.29     9199   48951
181      4202016   2090233.26     9223   41146
182      4225104   2099432.70     9199   48676
183      4248192   2108624.29     9191   48561
184      4271280   2117749.34     9125   45883
185      4294368   2126862.96     9113   47877
186      4317456   2135995.44     9132   45338
187      4340544   2145134.68     9139   43714
188      4363632   2154253.03     9118   49279
189      4386

343      7942272   3451428.46     7793   42512
344      7965360   3459251.53     7823   47588
345      7988448   3467035.86     7784   48659
346      8011536   3474769.32     7733   48523
347      8034624   3482529.10     7759   47392
348      8057712   3490228.64     7699   46749
349      8080800   3497970.87     7742   43172
350      8103888   3505771.31     7800   47484
351      8126976   3513527.48     7756   46514
352      8150064   3521229.83     7702   46479
353      8173152   3528930.55     7700   46948
354      8196240   3536664.41     7733   42916
355      8219328   3544329.24     7664   44042
356      8242416   3552102.90     7773   47570
357      8265504   3559786.43     7683   33326
358      8288592   3567529.57     7743   38339
359      8311680   3575219.83     7690   45802
360      8334768   3582978.40     7758   44056
361      8357856   3590683.90     7705   42252
362      8380944   3598327.61     7643   44639
363      8404032   3605968.68     7641   36366
364      8427

518     11982672   4739614.86     7025   48958
519     12005760   4746638.32     7023   49028
520     12028848   4753661.00     7022   49368
521     12051936   4760724.91     7063   48870
522     12075024   4767739.48     7014   49676
523     12098112   4774749.49     7010   44936
524     12121200   4781717.66     6968   47863
525     12144288   4788744.98     7027   48382
526     12167376   4795779.94     7034   47112
527     12190464   4802760.08     6980   49062
528     12213552   4809691.60     6931   49259
529     12236640   4816697.52     7005   44780
530     12259728   4823643.02     6945   49050
531     12282816   4830625.61     6982   49117
532     12305904   4837603.64     6978   47505
533     12328992   4844534.45     6930   48975
534     12352080   4851454.96     6920   49006
535     12375168   4858390.45     6935   44835
536     12398256   4865380.11     6989   43770
537     12421344   4872357.36     6977   42177
538     12444432   4879282.70     6925   48272
539     12467

693     16023072   5926104.66     6523   44794
694     16046160   5932602.87     6498   43580
695     16069248   5939210.39     6607   43391
696     16092336   5945791.25     6580   46010
697     16115424   5952398.76     6607   42426
698     16138512   5958954.63     6555   47521
699     16161600   5965524.27     6569   46392
700     16184688   5972090.60     6566   47852
701     16207776   5978680.26     6589   48051
702     16230864   5985237.95     6557   48143
703     16253952   5991779.73     6541   45013
704     16277040   5998318.51     6538   48930
705     16300128   6004875.29     6556   49386
706     16323216   6011406.00     6530   48993
707     16346304   6017970.06     6564   46745
708     16369392   6024473.97     6503   48469
709     16392480   6031044.81     6570   44615
710     16415568   6037540.85     6496   49259
711     16438656   6044031.86     6491   48384
712     16461744   6050612.40     6580   49179
713     16484832   6057089.17     6476   48784
714     16507

868     20063472   7047779.59     6265   44628
869     20086560   7054018.31     6238   49257
870     20109648   7060226.34     6208   47026
871     20132736   7066449.73     6223   44831
872     20155824   7072695.04     6245   48718
873     20178912   7078939.73     6244   49253
874     20202000   7085232.56     6292   49442
875     20225088   7091447.15     6214   49092
876     20248176   7097776.26     6329   48885
877     20271264   7104051.86     6275   45174
878     20294352   7110315.26     6263   48176
879     20317440   7116515.71     6200   48708
880     20340528   7122772.11     6256   49467
881     20363616   7129085.58     6313   44936
882     20386704   7135270.74     6185   49290
883     20409792   7141450.01     6179   44943
884     20432880   7147675.06     6225   49215
885     20455968   7153915.79     6240   48998
886     20479056   7160161.84     6246   49250
887     20502144   7166433.96     6272   48870
888     20525232   7172675.91     6241   49479
889     20548

Lowest loss
```
  #      # Words   Total Loss     Loss    w/s
997     23041824   7843654.07     6039   44406
```

In [8]:
# filtered abstracts
!spacy pretrain $FILE_ABSTRACTS_FILTERED en_core_sci_lg $DIR_MODELS_ABS_FIL --use-vectors

[38;5;4mℹ Using GPU[0m
[38;5;2m✔ Created output directory[0m
[38;5;2m✔ Saved settings to config.json[0m
[2K[38;5;2m✔ Loaded input texts[0m
[2K[38;5;2m✔ Loaded model 'en_core_sci_lg'[0m
[1m
  #      # Words   Total Loss     Loss    w/s
  0       133955   134987.812   134987   48975
  1       267910   266985.109   131997   65542
  2       401865   394696.984   127711   62524
  3       535820   517320.656   122623   64674
  4       669775   632906.188   115585   65846
  5       803730   741261.641   108355   65614
  6       937685   842894.250   101632   65237
  7      1071640   938650.766    95756   65031
  8      1205595   1031027.55    92376   57122
  9      1339550   1121111.80    90084   65556
 10      1473505   1209762.47    88650   59214
 11      1607460   1297069.90    87307   64010
 12      1741415   1383734.96    86665   62018
 13      1875370   1469760.93    86025   59915
 14      2009325   1555463.35    85702   65083
 15      2143280   1640813.27    85349   65755


168     22638395     11691502    55496   65757
169     22772350     11746514    55012   65242
170     22906305     11801889    55375   66017
171     23040260     11856917    55027   64693
172     23174215     11911856    54939   64253
173     23308170     11966575    54718   66013
174     23442125     12021134    54558   65718
175     23576080     12075701    54567   64575
176     23710035     12130254    54552   64243
177     23843990     12184766    54512   60629
178     23977945     12239173    54407   65126
179     24111900     12293844    54670   64855
180     24245855     12348670    54825   64116
181     24379810     12402792    54121   65049
182     24513765     12456690    53897   62960
183     24647720     12510645    53955   57786
184     24781675     12564573    53928   64743
185     24915630     12618370    53797   63549
186     25049585     12672251    53880   64685
187     25183540     12725824    53572   63364
188     25317495     12779474    53650   65433
189     25451

343     46080520     20376984    45621   65763
344     46214475     20422683    45698   63750
345     46348430     20468536    45852   64159
346     46482385     20513992    45456   65675
347     46616340     20559671    45679   65624
348     46750295     20605100    45428   65303
349     46884250     20650945    45845   65069
350     47018205     20696363    45417   63834
351     47152160     20741984    45620   65306
352     47286115     20787543    45559   64109
353     47420070     20832793    45250   62789
354     47554025     20878110    45316   63356
355     47687980     20923234    45123   58242
356     47821935     20968417    45183   62124
357     47955890     21013626    45208   64337
358     48089845     21058764    45138   65214
359     48223800     21104025    45260   64745
360     48357755     21149072    45047   64623
361     48491710     21194106    45034   65572
362     48625665     21239186    45079   64954
363     48759620     21284345    45159   65463
364     48893

518     69522645     27990416    41739   66475
519     69656600     28032244    41827   64894
520     69790555     28074044    41800   66273
521     69924510     28115643    41599   64297
522     70058465     28157479    41836   65785
523     70192420     28199169    41689   66722
524     70326375     28240743    41574   65977
525     70460330     28282438    41694   64201
526     70594285     28323934    41496   59960
527     70728240     28365957    42022   64766
528     70862195     28407632    41675   64321
529     70996150     28449283    41650   64554
530     71130105     28490858    41574   61209
531     71264060     28532305    41447   65714
532     71398015     28573792    41486   66084
533     71531970     28615663    41870   66168
534     71665925     28656968    41305   66004
535     71799880     28698423    41454   63627
536     71933835     28739823    41399   64398
537     72067790     28781211    41388   66123
538     72201745     28822555    41343   63736
539     72335

693     92964770     35098201    39729   66722
694     93098725     35137673    39472   66996
695     93232680     35177126    39453   65662
696     93366635     35216840    39713   66823
697     93500590     35256564    39724   66840
698     93634545     35296163    39598   65083
699     93768500     35335922    39758   66517
700     93902455     35375205    39283   60419
701     94036410     35414608    39402   65011
702     94170365     35454143    39534   65250
703     94304320     35493829    39686   55622
704     94438275     35533138    39308   66476
705     94572230     35572743    39605   65004
706     94706185     35612034    39290   64417
707     94840140     35651231    39196   66911
708     94974095     35690507    39275   65192
709     95108050     35729752    39245   66700
710     95242005     35769017    39264   65998
711     95375960     35808414    39397   65195
712     95509915     35847904    39489   66433
713     95643870     35887298    39393   66466
714     95777

868    116406895     41885951    38187   64590
869    116540850     41923830    37878   67406
870    116674805     41961748    37918   64390
871    116808760     41999868    38120   65046
872    116942715     42037992    38123   65246
873    117076670     42076145    38152   64756
874    117210625     42114237    38091   67370
875    117344580     42152077    37839   67400
876    117478535     42190266    38189   65924
877    117612490     42228352    38085   65977
878    117746445     42266194    37842   65669
879    117880400     42304202    38007   57973
880    118014355     42342030    37828   64447
881    118148310     42380069    38038   65426
882    118282265     42418216    38146   67611
883    118416220     42456174    37958   64596
884    118550175     42494418    38243   66113
885    118684130     42532329    37911   66558
886    118818085     42569938    37608   64476
887    118952040     42607904    37966   65990
888    119085995     42645754    37849   66670
889    119219

Lowest loss
```
  #      # Words   Total Loss     Loss    w/s
975    130740080     45921767    36994   65769
```

In [None]:
# all abstracts
!spacy pretrain $FILE_ABSTRACTS en_core_sci_lg $DIR_MODELS_ABS --use-vectors

[38;5;4mℹ Using GPU[0m
[38;5;2m✔ Created output directory[0m
[38;5;2m✔ Saved settings to config.json[0m
[2K[38;5;2m✔ Loaded input texts[0m
[2K[38;5;2m✔ Loaded model 'en_core_sci_lg'[0m
[1m
  #      # Words   Total Loss     Loss    w/s
  0       242591   244346.219   244346   47645
  0       480209   478352.438   234006   51776
  0       715706   701571.688   223219   55120
  0       954634   918571.062   216999   57265
  0      1186111   1115242.88   196671   55480
  0      1430592   1306334.20   191091   54459
  0      1668892   1480930.92   174596   55226
  0      1901182   1643360.61   162429   57284
  0      2131623   1798877.50   155516   58521
  0      2371848   1957739.00   158861   59465
  0      2603850   2108805.12   151066   57163
  0      2842072   2262124.73   153319   59317
  0      3070133   2407879.92   145755   58520
  0      3307874   2558555.47   150675   56470
  0      3547717   2709981.28   151425   59093
  0      3790030   2862529.33   152548   55095


  8     39115344     20555674   101194   63124
  8     39348004     20655737   100062   64090
  8     39590702     20759994   104257   60854
  8     39819309     20857712    97717   63665
  8     40058694     20959924   102212   63409
  8     40299024     21062091   102166   61973
  8     40531598     21161356    99265   62957
  8     40771783     21262945   101589   63379
  8     41009141     21363552   100606   61577
  8     41245496     21464488   100935   63251
  8     41484384     21565711   101223   59462
  8     41609520     21618508    52796   61135
  9     41852287     21720443   101935   60702
  9     42086205     21819370    98927   62863
  9     42322240     21918922    99551   62384
  9     42557936     22018003    99081   63606
  9     42789817     22115057    97053   64154
  9     43035948     22218153   103096   62092
  9     43275017     22318047    99893   64576
  9     43502913     22413626    95579   63585
  9     43743478     22514073   100447   63324
  9     43990

Lowest loss
```
  #      # Words   Total Loss     Loss    w/s

```