# Train ULMFiT + sentencepiece arxiv categories large classifier on arxiv abstracts

Compared to 01-train-ulmfit-sp notebook this one:
* uses model fine-tuned on arxiv abstracts (the model from 03-finetune-ulmfit-sp, 1 epoch + 1 epoch unfreeze)
* uses larger classifier to avoid 1200-50-176 bottleneck
* uses 2x bigger batch
* fp16

This notebook contains code for training an arxiv categories classifier using ULMFiT with sentencepiece unigram tokenization model. Both the tokenizator and language model were trained on corpus of 64K+ machine learning papers. In this notebook we train classifier (after finetuning) on arxiv data using only titles and abstracts to predict categories. We use papers published before 2020 as a training set and after 2020 as a validation set, excluding arxiv test set from both sets.

In [1]:
%cd ~/paperswithcode/paper-extractor

/home/ubuntu/paperswithcode/paper-extractor


In [2]:
import pandas as pd, numpy as np
from pathlib import Path

DATA_PATH = Path("notebooks/shared-notebooks/arxiv-class")
TRAIN_PATH = DATA_PATH / "arxiv-tag-classifier-data.json"
TEST_PATH = DATA_PATH / "classifier.tsv"

In [3]:
from fastai.text import *

BASE_DIR = Path("./models/ulmfit_baseline")
VOCAB_PATH = BASE_DIR / "data_lm_export_vocab.pkl"
MODELS_PATH = DATA_PATH / "models"

processor = SPProcessor(sp_model=BASE_DIR / "tmp" / "spm.model", sp_vocab=BASE_DIR / "tmp" / "spm.vocab", n_cpus=8, mark_fields=True)
vocab = Vocab.load(VOCAB_PATH)

In [4]:
data_clas = load_data(MODELS_PATH, "data_clas_abs.pkl", bs=128, num_workers=16)

In [5]:
def set_seed(seed=None):
    if seed is not None:
        torch.manual_seed(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        np.random.seed(seed)

In [6]:
set_seed(42)
lin_ftrs = [len(data_clas.valid_dl.y.classes) * 2] # 352
learn = text_classifier_learner(data_clas, AWD_LSTM, lin_ftrs=lin_ftrs).to_fp16()
micro_f1 = MultiLabelFbeta(learn, beta=1.0)
learn.metrics = [micro_f1]

In [7]:
learn.load_encoder("arxiv_enc_sp30k_1_1_abstracts.pkl")

In [8]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,multi_label_fbeta,micro_fbeta,time
0,0.022476,0.021277,0.610026,17:22,


In [9]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))

epoch,train_loss,valid_loss,multi_label_fbeta,micro_fbeta,time
0,0.020063,0.020329,0.637803,24:33,


In [9]:
# old results
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(5e-3/2., 5e-3))

epoch,train_loss,valid_loss,multi_label_fbeta,micro_fbeta,time
0,0.019959,0.019981,0.640732,14:45,


In [10]:
learn.unfreeze()
learn.fit_one_cycle(6, slice(2e-3/100, 2e-3))

epoch,train_loss,valid_loss,multi_label_fbeta,micro_fbeta,time
0,0.01874,0.020213,0.643172,41:41,
1,0.017784,0.019404,0.653917,42:13,
2,0.017556,0.018888,0.659366,40:29,
3,0.017756,0.018624,0.664363,37:27,
4,0.016774,0.018277,0.669639,41:15,
5,0.016667,0.018294,0.670396,40:32,


In [11]:
learn.save("arxiv_large_class_sp30k_1_1_ft_1_1_6_abstracts.pkl")

In [None]:
# %%javascript
# IPython.notebook.save_notebook()