## Repeated Cross Validation for broad category model

The notebook manages the main pipeline for models training/testing cross validation.
* Reads the annotated dataset
* Select only the largets broad categories that have enough data for training
* Transforms the data into format for training spaCy models
* It split the training data into chunks according to the cross validations folds settings
* Generates the jsonl input files
* It executes the pipeline by training and testing accross all chunks of data

The main pipeline is contained in the folder 'textcat_broad_categories' and it's set up to accept the input files:
* assays_train.jsonl
* assays_test.jsonl

After execution of input data it generates the following outputs:
* model-best
* model-last
* metrics.json

While executing CV, each of the chunks data inputs is copied to the pipeline folder. The pipeline is executed with those files and outputs generated are copied back out to the specified path.

For live testing of a specific model, the output files of such model can be moved to the output folder and execute the pipeline commands.

### Importing modules

In [1]:
import pandas as pd
from pathlib import Path
import os
import json
import spacy
from sklearn.model_selection import RepeatedKFold
import shutil
import subprocess

### Setup

In [2]:
# Set path to python in the environment to use for model training
env_path = "/hps/software/users/chembl/ines/assays_description/bin/python"

In [3]:
#Settings for display (if needed)
pd.set_option('display.max_colwidth', None)  # Set to None to display the full column width
pd.set_option('display.max_rows', None)      # Set to None to display al

### Model training: Binding assays and functional assays

##### Main annotated dataset

In [4]:
dataset = pd.read_csv('data/2_broad_category_training_data.csv')

In [5]:
dataset.head()

Unnamed: 0,assay_id,assay_type,description,label,bao_preferred_term,bao_id
0,868,B,"Inhibition of [3H]8-hydroxy-2-dipropylamino-1,2,3,4-tetrahydronaphthalene binding to 5-hydroxytryptamine 1A receptor in hippocampus region of rat brain; Residual radioligand binding higher than 50%","Radioligand competition, displacement, binding",radioligand binding assay,BAO_0002776
1,2027,B,Displacement of [3H]-5-HT from human 5-hydroxytryptamine 1D receptor beta,"Radioligand competition, displacement, binding",radioligand binding assay,BAO_0002776
2,2430,B,Inhibition constant for in vitro inhibition of [3H]ketanserin binding to rat frontal cortex membranes 5-hydroxytryptamine 2A receptor,"Radioligand competition, displacement, binding",radioligand binding assay,BAO_0002776
3,3306,B,Compound was evaluated for the binding affinity against human cloned 5-hydroxytryptamine 4 receptor in HeLa cells using [3H]-LSD as the radioligand,"Radioligand competition, displacement, binding",radioligand binding assay,BAO_0002776
4,3703,B,In vitro binding affinity by radioligand binding assay using cell line expressing human 5-hydroxytryptamine 7 receptor; ND means not determined,"Radioligand competition, displacement, binding",radioligand binding assay,BAO_0002776


In [6]:
### Check annotated broad categories
dataset['label'].value_counts()

label
Nucleic acid binding                              176
Protein activity                                  171
Binding affinity, displacement, competition       134
Radioligand competition, displacement, binding    118
Cell phenotype                                    113
Antimicrobial activity                             88
in vivo method                                     63
Name: count, dtype: int64

In [7]:
len(dataset)

863

In [8]:
# Can only train the large categories
chosen_categories = [
    'Nucleic acid binding'
    , 'Protein activity'
    , 'Binding affinity, displacement, competition'
    , 'Radioligand competition, displacement, binding'
    , 'Cell phenotype'
    , 'Antimicrobial activity'
    , 'in vivo method'
]

In [9]:
processed_df = dataset.loc[dataset['label'].isin(chosen_categories)]

In [10]:
len(dataset), len(processed_df)

(863, 863)

In [11]:
dataset_counts = pd.DataFrame(data=processed_df['label'].value_counts().reset_index())
dataset_counts.rename(columns={'count': 'assay_count'}, inplace=True)
dataset_counts

Unnamed: 0,label,assay_count
0,Nucleic acid binding,176
1,Protein activity,171
2,"Binding affinity, displacement, competition",134
3,"Radioligand competition, displacement, binding",118
4,Cell phenotype,113
5,Antimicrobial activity,88
6,in vivo method,63


In [12]:
labels_ids = processed_df[['label', 'bao_preferred_term', 'bao_id']].drop_duplicates()
labels_ids

Unnamed: 0,label,bao_preferred_term,bao_id
0,"Radioligand competition, displacement, binding",radioligand binding assay,BAO_0002776
6,Protein activity,functional target-based,BAO_0013016
12,"Binding affinity, displacement, competition",binding assay,BAO_0002989
16,in vivo method,in vivo assay method,BAO_0040021
22,Cell phenotype,cell phenotype,BAO_0002542
28,Nucleic acid binding,,No suitable ID at present
41,Antimicrobial activity,,No suitable ID at present


In [13]:
dataset_counts_merged = dataset_counts.merge(labels_ids, on = 'label')
dataset_counts_merged

Unnamed: 0,label,assay_count,bao_preferred_term,bao_id
0,Nucleic acid binding,176,,No suitable ID at present
1,Protein activity,171,functional target-based,BAO_0013016
2,"Binding affinity, displacement, competition",134,binding assay,BAO_0002989
3,"Radioligand competition, displacement, binding",118,radioligand binding assay,BAO_0002776
4,Cell phenotype,113,cell phenotype,BAO_0002542
5,Antimicrobial activity,88,,No suitable ID at present
6,in vivo method,63,in vivo assay method,BAO_0040021


##### Process input data to the format for training in spaCy

In [14]:
value_dict = {
    'Radioligand competition, displacement, binding': {'Radioligand binding (BAO_0002776)': 1.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}
    , 'Binding affinity, displacement, competition': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 1.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}
    , 'Protein activity': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 1.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}
    , 'in vivo method': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 1.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}
    , 'Cell phenotype': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 1.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}
    , 'Nucleic acid binding': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 1.0, 'Antimicrobial activity': 0.0}
    , 'Antimicrobial activity': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 1.0}

}

In [15]:
processed_df['cats'] = (
    processed_df['label']
    .apply(lambda x: value_dict[x])
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processed_df['cats'] = (


In [16]:
processed_df.rename(columns={'description': 'text'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processed_df.rename(columns={'description': 'text'}, inplace=True)


In [17]:
processed_df = processed_df[['text', 'cats']]

In [18]:
processed_df

Unnamed: 0,text,cats
0,"Inhibition of [3H]8-hydroxy-2-dipropylamino-1,2,3,4-tetrahydronaphthalene binding to 5-hydroxytryptamine 1A receptor in hippocampus region of rat brain; Residual radioligand binding higher than 50%","{'Radioligand binding (BAO_0002776)': 1.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}"
1,Displacement of [3H]-5-HT from human 5-hydroxytryptamine 1D receptor beta,"{'Radioligand binding (BAO_0002776)': 1.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}"
2,Inhibition constant for in vitro inhibition of [3H]ketanserin binding to rat frontal cortex membranes 5-hydroxytryptamine 2A receptor,"{'Radioligand binding (BAO_0002776)': 1.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}"
3,Compound was evaluated for the binding affinity against human cloned 5-hydroxytryptamine 4 receptor in HeLa cells using [3H]-LSD as the radioligand,"{'Radioligand binding (BAO_0002776)': 1.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}"
4,In vitro binding affinity by radioligand binding assay using cell line expressing human 5-hydroxytryptamine 7 receptor; ND means not determined,"{'Radioligand binding (BAO_0002776)': 1.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}"
6,Antagonism of NECA-induced stimulation of adenylate cyclase activity in human platelet membranes at A2-adenosine receptor; value ranges from 21-30,"{'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 1.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}"
7,Inhibitory activity was calculated for the model Acetylcholinesterase (Expt-1),"{'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 1.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}"
8,Percent remaining of radioligand [3H]DPCPX binding to human adenosine A1 receptor at 10 uM,"{'Radioligand binding (BAO_0002776)': 1.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}"
9,Concentration required for 50% inhibition of [3H]NECA binding on rat brain adenosine A2 receptor,"{'Radioligand binding (BAO_0002776)': 1.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}"
10,In vitro concentration required for 50% inhibition against Adenosine Kinase (AK) in the presence of intact cells,"{'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 1.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}"


##### Setting up the Cross-Validation

In [19]:
mpath = "Model_cv"
dirpath = Path(mpath)
if dirpath.exists():
    shutil.rmtree(dirpath)
os.makedirs(mpath)

#Method to generate the cross validation chunks
def save_CV_folds_json(mpath, dataset):
    
    rkf = RepeatedKFold(n_splits=5, n_repeats=5, random_state=573927565)
    for fold, (train_index, test_index) in enumerate(rkf.split(dataset)):
        train_df = dataset.iloc[train_index]
        test_df = dataset.iloc[test_index]

        path = os.path.join(mpath,'chunk{}'.format(fold))
        os.makedirs(path, exist_ok=True)
        
        # Write to JSONL files
        with open(os.path.join(path,'assays_train.jsonl'), 'w') as f:
            f.write(train_df.to_json(orient='records', lines=True))
        with open(os.path.join(path,'assays_test.jsonl'), 'w') as f:
            f.write(test_df.to_json(orient='records', lines=True))

In [20]:
save_CV_folds_json(mpath, processed_df)

##### Executing Cross-Validation: Move inputs to pipeline, run pipeline, take outputs

In [21]:
for f in Path(mpath).iterdir():
    
    chunk = f.name
    print(chunk)
    train = os.path.join('../', f,'assays_train.jsonl')
    test = os.path.join('../', f,'assays_test.jsonl')

    # Chdir to the pipeline template folder
    os.chdir('textcat_broad_categories')
    
    os.mkdir('assets')
    os.mkdir('training')

    #Copy the current input files to the pipeline path
    shutil.copy(train, os.path.join('assets/'))
    shutil.copy(test, os.path.join('assets/'))

    #Run the pipeline
    command = f'{env_path} -m weasel run all'
    subprocess.run(command, shell=True, capture_output=False, text=True)
    
    #Move outputs to the chunk folder
    opath = os.path.join('../', f, 'training')
    shutil.copytree('training', opath, dirs_exist_ok=True)

    # Remove directories to start clean
    shutil.rmtree('assets')
    shutil.rmtree('training')
    shutil.rmtree('corpus')
    os.remove('project.lock')

    os.chdir('../')


chunk0


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.63           0.36       55.79    0.56


  1     200         91.61          60.29       61.17    0.61


  3     400         45.76          28.94       79.86    0.80


  5     600         30.53          20.15       90.38    0.90


  7     800         30.87          14.99       94.97    0.95


 10    1000         35.64          12.52       96.48    0.96


 14    1200         39.75           9.75       96.93    0.97


 19    1400         40.30           7.78       97.53    0.98


 25    1600         44.85           6.10       97.71    0.98


 32    1800         51.05           5.01       97.86    0.98


 41    2000         49.02           3.55       97.92    0.98


 52    2200         52.00           2.71       98.08    0.98


 65    2400         50.27           1.99       98.16    0.98


 79    2600         34.93           1.34       98.08    0.98


 92    2800         31.33           1.02       98.16    0.98


105    3000         23.59           0.77       98.04    0.98


119    3200         18.46           0.57       98.14    0.98


132    3400         16.52           0.40       98.09    0.98


145    3600         13.92           0.33       98.13    0.98


159    3800         10.02           0.24       98.16    0.98


172    4000          8.29           0.18       98.09    0.98


185    4200          6.08           0.16       98.14    0.98


199    4400          6.61           0.16       98.07    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.16 
SPEED                 28113 

[1m

                                      P        R       F
Radioligand binding (BAO_0002776)    88.46   100.00   93.88
Binding (BAO_0002989)             96.00    96.00   96.00
Protein activity (BAO_0013016)    85.19    79.31   82.14
in vivo method (BAO_0040021)     100.00    76.92   86.96
Cell phenotype (BAO_0002542)      96.15    89.29   92.59
Nucleic acid binding              95.35    97.62   96.47
Antimicrobial activity            91.67    84.62   88.00

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      1.00
Binding (BAO_0002989)               0.99
Protein activity (BAO_0013016)      0.97
in vivo method (BAO_0040021)        0.97
Cell phenotype (BAO_0002542)        0.96
Nucleic acid binding                1.00
Antimicrobial activity              0.99

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running work

chunk1


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.42           0.31       51.32    0.51


  1     200         87.29          59.07       59.37    0.59


  2     400         45.33          28.71       81.22    0.81


  5     600         31.13          19.39       87.54    0.88


  7     800         32.16          14.72       90.37    0.90


 10    1000         34.87          11.40       92.48    0.92


 14    1200         33.88           8.62       93.69    0.94


 19    1400         35.59           6.63       94.60    0.95


 25    1600         35.78           5.30       95.75    0.96


 32    1800         40.83           4.43       96.28    0.96


 41    2000         37.53           3.25       96.75    0.97


 52    2200         38.01           2.40       97.01    0.97


 65    2400         38.89           1.84       97.36    0.97


 78    2600         27.07           1.30       97.64    0.98


 91    2800         27.38           1.08       97.79    0.98


105    3000         19.68           0.77       97.66    0.98


118    3200         16.45           0.60       97.64    0.98


131    3400         12.09           0.43       97.70    0.98


145    3600          8.95           0.36       97.66    0.98


158    3800          6.55           0.27       97.68    0.98


171    4000          5.69           0.22       97.68    0.98


185    4200          6.27           0.19       97.58    0.98


198    4400          6.58           0.16       97.68    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   97.79 
SPEED                 10952 

[1m

                                      P       R       F
Radioligand binding (BAO_0002776)   100.00   72.73   84.21
Binding (BAO_0002989)             81.25   89.66   85.25
Protein activity (BAO_0013016)    69.70   76.67   73.02
in vivo method (BAO_0040021)     100.00   72.73   84.21
Cell phenotype (BAO_0002542)      87.50   70.00   77.78
Nucleic acid binding              96.67   93.55   95.08
Antimicrobial activity           100.00   80.00   88.89

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.98
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.96
in vivo method (BAO_0040021)        0.96
Cell phenotype (BAO_0002542)        0.98
Nucleic acid binding                1.00
Antimicrobial activity              0.99

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'al

chunk2


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.51           0.29       53.61    0.54


  1     200         85.82          58.67       60.28    0.60


  2     400         46.33          28.45       80.04    0.80


  5     600         32.70          20.45       90.05    0.90


  7     800         34.67          15.14       92.82    0.93


 10    1000         35.31          11.70       94.53    0.95


 14    1200         36.14           8.70       95.68    0.96


 19    1400         35.53           7.06       96.44    0.96


 25    1600         37.11           5.57       96.56    0.97


 32    1800         38.33           4.44       96.40    0.96


 41    2000         39.35           3.37       96.55    0.97


 51    2200         39.37           2.61       96.79    0.97


 64    2400         41.98           1.96       96.85    0.97


 78    2600         33.59           1.45       97.11    0.97


 91    2800         28.24           1.02       97.63    0.98


104    3000         22.05           0.74       97.48    0.97


118    3200         17.50           0.58       97.96    0.98


131    3400         12.54           0.45       98.19    0.98


144    3600         11.76           0.34       98.18    0.98


158    3800          8.94           0.27       98.09    0.98


171    4000          8.54           0.22       98.19    0.98


184    4200          6.95           0.22       98.20    0.98


198    4400          5.23           0.16       97.83    0.98


211    4600          3.80           0.16       97.92    0.98


224    4800          5.57           0.15       98.03    0.98


238    5000          3.62           0.10       98.08    0.98


251    5200          5.81           0.07       98.15    0.98


264    5400          2.89           0.06       98.16    0.98


278    5600          2.96           0.04       98.20    0.98


291    5800          3.54           0.05       98.23    0.98


304    6000          4.02           0.03       98.04    0.98


318    6200          2.31           0.03       98.15    0.98


331    6400          1.27           0.03       98.12    0.98


344    6600          1.32           0.02       98.06    0.98


358    6800          2.82           0.02       98.08    0.98


371    7000          1.05           0.01       98.19    0.98


384    7200          1.16           0.02       98.23    0.98


398    7400          1.43           0.01       98.13    0.98


411    7600          1.27           0.02       98.08    0.98


424    7800          1.86           0.01       98.23    0.98


438    8000          1.59           0.02       98.22    0.98


451    8200          2.39           0.02       98.25    0.98


464    8400          2.37           0.02       98.26    0.98


478    8600          2.14           0.01       98.24    0.98


491    8800          0.46           0.01       98.13    0.98


504    9000          0.40           0.01       98.12    0.98


518    9200          1.37           0.01       98.15    0.98


531    9400          0.15           0.00       98.09    0.98


544    9600          0.15           0.01       98.11    0.98


558    9800          0.14           0.00       98.16    0.98


571   10000          1.05           0.01       98.12    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.26 
SPEED                 28200 

[1m

                                      P       R       F
Radioligand binding (BAO_0002776)    90.48   86.36   88.37
Binding (BAO_0002989)             95.45   75.00   84.00
Protein activity (BAO_0013016)    83.33   81.40   82.35
in vivo method (BAO_0040021)     100.00   92.86   96.30
Cell phenotype (BAO_0002542)      95.24   86.96   90.91
Nucleic acid binding             100.00   96.30   98.11
Antimicrobial activity            87.50   87.50   87.50

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      1.00
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.96
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.98
Nucleic acid binding                1.00
Antimicrobial activity              0.97

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'al

chunk3


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.39           0.46       52.41    0.52


  1     200         90.73          59.23       58.93    0.59


  2     400         46.00          28.84       73.60    0.74


  5     600         32.09          19.79       83.98    0.84


  7     800         31.94          14.89       88.55    0.89


 10    1000         35.46          11.84       91.51    0.92


 14    1200         34.02           9.34       93.62    0.94


 19    1400         34.64           7.21       94.72    0.95


 25    1600         43.42           6.05       95.45    0.95


 32    1800         41.37           4.63       96.05    0.96


 41    2000         41.71           3.52       96.81    0.97


 51    2200         42.83           2.55       97.33    0.97


 64    2400         41.97           1.90       97.32    0.97


 78    2600         33.76           1.37       97.46    0.97


 91    2800         26.05           0.94       97.45    0.97


104    3000         21.73           0.74       97.58    0.98


118    3200         17.83           0.57       97.53    0.98


131    3400         12.18           0.40       97.48    0.97


144    3600         10.12           0.32       97.55    0.98


158    3800          8.54           0.26       97.60    0.98


171    4000         10.08           0.27       97.62    0.98


184    4200          5.49           0.20       97.63    0.98


198    4400          4.41           0.17       97.60    0.98


211    4600          4.72           0.16       97.64    0.98


224    4800          4.88           0.14       97.66    0.98


238    5000          4.25           0.13       97.65    0.98


251    5200          2.50           0.09       97.68    0.98


264    5400          2.67           0.09       97.60    0.98


278    5600          3.01           0.06       97.59    0.98


291    5800          2.95           0.05       97.58    0.98


304    6000          2.02           0.04       97.44    0.97


318    6200          2.48           0.03       97.61    0.98


331    6400          1.59           0.02       97.73    0.98


344    6600          0.94           0.03       97.61    0.98


358    6800          1.01           0.02       97.67    0.98


371    7000          1.50           0.02       97.67    0.98


384    7200          1.00           0.01       97.73    0.98


398    7400          0.63           0.01       97.72    0.98


411    7600          1.46           0.02       97.70    0.98


424    7800          0.25           0.01       97.71    0.98


438    8000          1.31           0.02       97.81    0.98


451    8200          0.98           0.01       97.79    0.98


464    8400          0.99           0.01       97.79    0.98


478    8600          2.81           0.02       97.81    0.98


491    8800          1.32           0.01       97.75    0.98


504    9000          0.44           0.01       97.75    0.98


518    9200          0.64           0.01       97.73    0.98


531    9400          0.13           0.00       97.78    0.98


544    9600          0.38           0.01       97.82    0.98


558    9800          0.76           0.01       97.83    0.98


571   10000          0.46           0.01       97.79    0.98


584   10200          0.85           0.01       97.82    0.98


598   10400          0.84           0.01       97.86    0.98


611   10600          0.06           0.01       97.89    0.98


624   10800          1.15           0.01       97.87    0.98


638   11000          0.24           0.00       97.88    0.98


651   11200          0.84           0.00       97.94    0.98


664   11400          0.35           0.01       97.94    0.98


678   11600          0.30           0.00       97.95    0.98


691   11800          0.34           0.00       97.91    0.98


704   12000          0.30           0.00       97.86    0.98


718   12200          0.20           0.00       97.85    0.98


731   12400          0.10           0.00       97.89    0.98


744   12600          0.79           0.01       97.89    0.98


758   12800          0.80           0.01       97.87    0.98


771   13000          0.26           0.00       97.87    0.98


784   13200          0.34           0.00       97.96    0.98


798   13400          0.11           0.00       97.97    0.98


811   13600          0.63           0.00       97.92    0.98


824   13800          0.05           0.00       97.89    0.98


838   14000          1.43           0.01       97.89    0.98


851   14200          0.29           0.01       97.89    0.98


864   14400          0.44           0.01       97.86    0.98


878   14600          0.08           0.00       97.87    0.98


891   14800          0.64           0.00       97.88    0.98


904   15000          0.02           0.00       97.87    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   97.97 
SPEED                 28144 

[1m

                                      P        R        F
Radioligand binding (BAO_0002776)    95.65    95.65    95.65
Binding (BAO_0002989)             80.77    75.00    77.78
Protein activity (BAO_0013016)    85.29    70.73    77.33
in vivo method (BAO_0040021)     100.00   100.00   100.00
Cell phenotype (BAO_0002542)      61.54    80.00    69.57
Nucleic acid binding              92.50    90.24    91.36
Antimicrobial activity           100.00    77.78    87.50

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.97
Binding (BAO_0002989)               0.97
Protein activity (BAO_0013016)      0.95
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.99
Nucleic acid binding                1.00
Antimicrobial activity              0.98

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Runn

chunk4


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.54           0.39       52.54    0.53


  1     200         89.76          59.35       58.54    0.59


  2     400         45.97          30.13       76.79    0.77


  5     600         33.21          19.81       90.52    0.91


  7     800         33.34          15.63       94.20    0.94


 10    1000         35.07          11.20       96.17    0.96


 14    1200         34.77           9.04       96.97    0.97


 19    1400         39.24           7.13       97.49    0.97


 25    1600         39.67           5.69       97.81    0.98


 32    1800         43.27           4.70       97.82    0.98


 41    2000         44.80           3.44       97.83    0.98


 51    2200         43.56           2.66       98.07    0.98


 64    2400         41.57           1.89       98.07    0.98


 78    2600         39.49           1.45       98.12    0.98


 91    2800         28.58           0.99       98.13    0.98


104    3000         18.51           0.76       98.07    0.98


118    3200         17.75           0.59       98.14    0.98


131    3400         15.73           0.49       98.17    0.98


144    3600         11.29           0.35       98.16    0.98


158    3800          8.72           0.24       98.08    0.98


171    4000          8.42           0.21       98.13    0.98


184    4200          9.07           0.17       98.11    0.98


198    4400          6.65           0.13       98.06    0.98


211    4600          6.34           0.09       98.10    0.98


224    4800          7.54           0.11       98.12    0.98


238    5000          2.79           0.06       98.07    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.17 
SPEED                 26948 

[1m

                                      P       R       F
Radioligand binding (BAO_0002776)   100.00   89.29   94.34
Binding (BAO_0002989)             92.00   95.83   93.88
Protein activity (BAO_0013016)    80.77   75.00   77.78
in vivo method (BAO_0040021)      92.86   92.86   92.86
Cell phenotype (BAO_0002542)      93.75   68.18   78.95
Nucleic acid binding             100.00   94.29   97.06
Antimicrobial activity           100.00   76.19   86.49

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.97
Binding (BAO_0002989)               1.00
Protein activity (BAO_0013016)      0.97
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.99
Nucleic acid binding                1.00
Antimicrobial activity              0.95

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'al

chunk5


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.74           0.34       54.37    0.54


  1     200         89.67          60.25       59.54    0.60


  2     400         44.77          29.16       80.84    0.81


  5     600         31.31          20.20       89.86    0.90


  7     800         32.30          15.05       93.27    0.93


 10    1000         35.55          12.13       94.72    0.95


 14    1200         37.97           9.40       95.58    0.96


 19    1400         37.76           7.60       96.44    0.96


 25    1600         41.28           6.14       96.80    0.97


 32    1800         45.38           4.88       96.92    0.97


 41    2000         44.03           3.57       97.12    0.97


 52    2200         42.80           2.57       97.22    0.97


 65    2400         38.45           1.87       97.43    0.97


 78    2600         32.35           1.29       97.45    0.97


 91    2800         25.14           0.95       97.57    0.98


105    3000         18.47           0.64       97.67    0.98


118    3200         15.34           0.53       97.92    0.98


131    3400         12.80           0.40       97.73    0.98


145    3600         10.29           0.33       97.98    0.98


158    3800         10.01           0.25       97.83    0.98


171    4000          5.28           0.18       97.85    0.98


185    4200          6.71           0.15       97.78    0.98


198    4400          4.28           0.12       97.65    0.98


211    4600          4.96           0.11       97.70    0.98


225    4800          2.46           0.10       97.71    0.98


238    5000          4.86           0.11       97.66    0.98


251    5200          2.90           0.09       97.98    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   97.98 
SPEED                 28883 

[1m

                                     P       R       F
Radioligand binding (BAO_0002776)   90.48   90.48   90.48
Binding (BAO_0002989)            84.00   84.00   84.00
Protein activity (BAO_0013016)   84.62   84.62   84.62
in vivo method (BAO_0040021)     92.31   92.31   92.31
Cell phenotype (BAO_0002542)     94.12   76.19   84.21
Nucleic acid binding             94.12   96.97   95.52
Antimicrobial activity           94.12   76.19   84.21

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.99
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.97
in vivo method (BAO_0040021)        0.98
Cell phenotype (BAO_0002542)        0.99
Nucleic acid binding                1.00
Antimicrobial activity              0.94

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'all'[0m


chunk6


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.40           0.43       54.52    0.55


  1     200         91.24          60.73       60.96    0.61


  2     400         44.39          28.76       80.46    0.80


  5     600         30.32          19.29       88.49    0.88


  7     800         34.32          15.01       92.84    0.93


 10    1000         34.10          12.05       94.72    0.95


 14    1200         37.65           9.46       96.07    0.96


 19    1400         37.69           7.89       96.97    0.97


 25    1600         39.50           6.19       97.49    0.97


 32    1800         44.01           4.95       97.90    0.98


 41    2000         45.14           3.94       98.01    0.98


 51    2200         46.51           2.94       98.10    0.98


 64    2400         42.09           2.24       98.27    0.98


 78    2600         34.18           1.59       98.25    0.98


 91    2800         28.47           1.20       98.71    0.99


104    3000         22.37           0.89       98.63    0.99


118    3200         17.51           0.64       98.72    0.99


131    3400         14.50           0.49       98.85    0.99


144    3600          9.37           0.35       98.88    0.99


158    3800         10.29           0.33       98.90    0.99


171    4000          9.32           0.28       98.95    0.99


184    4200          6.86           0.20       98.93    0.99


198    4400          5.34           0.19       98.85    0.99


211    4600          4.47           0.17       98.91    0.99


224    4800          4.77           0.15       98.90    0.99


238    5000          5.12           0.16       98.91    0.99


251    5200          2.34           0.12       98.94    0.99


264    5400          3.49           0.08       98.88    0.99


278    5600          5.66           0.06       98.90    0.99
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.95 
SPEED                 28946 

[1m

                                      P        R        F
Radioligand binding (BAO_0002776)    96.00    82.76    88.89
Binding (BAO_0002989)             96.30    96.30    96.30
Protein activity (BAO_0013016)    91.18    91.18    91.18
in vivo method (BAO_0040021)     100.00   100.00   100.00
Cell phenotype (BAO_0002542)      85.19    92.00    88.46
Nucleic acid binding              96.97    94.12    95.52
Antimicrobial activity           100.00    86.67    92.86

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.96
Binding (BAO_0002989)               1.00
Protein activity (BAO_0013016)      0.98
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        1.00
Nucleic acid binding                1.00
Antimicrobial activity              0.99

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Runn

chunk7


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.37           0.38       51.23    0.51


  1     200         91.52          59.38       59.47    0.59


  3     400         47.34          28.93       81.24    0.81


  5     600         31.79          19.54       87.67    0.88


  7     800         33.48          14.09       89.59    0.90


 10    1000         34.59          11.43       91.37    0.91


 14    1200         35.44           8.70       92.52    0.93


 19    1400         37.93           6.96       93.15    0.93


 25    1600         36.73           5.36       93.95    0.94


 32    1800         41.19           4.33       94.86    0.95


 41    2000         40.73           3.30       95.16    0.95


 52    2200         41.32           2.36       96.58    0.97


 65    2400         42.65           1.80       96.49    0.96


 78    2600         35.53           1.38       96.56    0.97


 92    2800         26.08           0.94       96.66    0.97


105    3000         20.88           0.70       96.61    0.97


118    3200         19.51           0.55       96.90    0.97


132    3400         10.34           0.40       97.01    0.97


145    3600         11.42           0.34       96.99    0.97


158    3800         10.14           0.31       97.02    0.97


172    4000          6.69           0.23       97.26    0.97


185    4200          8.29           0.19       96.91    0.97


198    4400          7.78           0.17       96.89    0.97


212    4600          4.03           0.11       96.72    0.97


225    4800          4.07           0.09       97.10    0.97


238    5000          4.16           0.08       96.70    0.97


252    5200          5.13           0.07       96.76    0.97


265    5400          3.41           0.05       96.73    0.97


278    5600          3.16           0.05       96.95    0.97
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   97.26 
SPEED                 28850 

[1m

                                      P       R       F
Radioligand binding (BAO_0002776)   100.00   86.96   93.02
Binding (BAO_0002989)             83.33   86.21   84.75
Protein activity (BAO_0013016)    86.11   79.49   82.67
in vivo method (BAO_0040021)     100.00   77.78   87.50
Cell phenotype (BAO_0002542)      73.33   73.33   73.33
Nucleic acid binding              97.30   92.31   94.74
Antimicrobial activity            93.75   78.95   85.71

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.99
Binding (BAO_0002989)               0.96
Protein activity (BAO_0013016)      0.97
in vivo method (BAO_0040021)        0.97
Cell phenotype (BAO_0002542)        0.93
Nucleic acid binding                1.00
Antimicrobial activity              0.98

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'al

chunk8


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.29           0.37       53.67    0.54


  1     200         85.12          58.13       59.91    0.60


  2     400         43.09          29.37       75.14    0.75


  4     600         31.78          19.81       86.01    0.86


  7     800         32.85          15.49       89.65    0.90


 10    1000         36.94          12.12       91.87    0.92


 14    1200         39.32           9.41       93.70    0.94


 19    1400         37.89           7.42       94.60    0.95


 24    1600         41.71           5.93       95.25    0.95


 32    1800         41.33           4.66       95.90    0.96


 40    2000         45.48           3.69       96.05    0.96


 51    2200         45.01           2.86       96.29    0.96


 64    2400         43.29           2.20       96.52    0.97


 77    2600         33.49           1.63       96.55    0.97


 90    2800         29.77           1.30       96.66    0.97


104    3000         20.84           1.06       96.74    0.97


117    3200         17.05           0.83       96.97    0.97


130    3400         14.74           0.67       97.15    0.97


144    3600         14.22           0.55       97.28    0.97


157    3800         15.04           0.41       97.42    0.97


170    4000          8.89           0.28       97.35    0.97


184    4200          8.48           0.26       97.30    0.97


197    4400          5.40           0.20       97.33    0.97


210    4600          6.17           0.18       97.43    0.97


224    4800          6.34           0.17       97.50    0.97


237    5000          3.86           0.15       97.43    0.97


250    5200          6.83           0.14       97.49    0.97


264    5400          3.01           0.13       97.49    0.97


277    5600          4.77           0.14       97.39    0.97


290    5800          0.81           0.10       97.36    0.97


304    6000          2.94           0.12       97.49    0.97


317    6200          3.65           0.12       97.51    0.98


330    6400          1.03           0.11       97.56    0.98


344    6600          1.98           0.11       97.59    0.98


357    6800          0.39           0.09       97.61    0.98


370    7000          2.11           0.11       97.55    0.98


384    7200          1.56           0.10       97.42    0.97


397    7400          3.20           0.11       97.42    0.97


410    7600          0.43           0.10       97.45    0.97


424    7800          1.17           0.10       97.40    0.97


437    8000          2.03           0.10       97.40    0.97


450    8200          1.89           0.10       97.45    0.97


464    8400          1.42           0.10       97.47    0.97
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   97.61 
SPEED                 23014 

[1m

                                      P        R       F
Radioligand binding (BAO_0002776)    95.45    95.45   95.45
Binding (BAO_0002989)             84.00    80.77   82.35
Protein activity (BAO_0013016)    75.00    63.64   68.85
in vivo method (BAO_0040021)     100.00    76.47   86.67
Cell phenotype (BAO_0002542)      90.48    76.00   82.61
Nucleic acid binding              91.89    94.44   93.15
Antimicrobial activity            92.86   100.00   96.30

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      1.00
Binding (BAO_0002989)               0.99
Protein activity (BAO_0013016)      0.93
in vivo method (BAO_0040021)        0.96
Cell phenotype (BAO_0002542)        0.96
Nucleic acid binding                1.00
Antimicrobial activity              1.00

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running work

chunk9


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.28           0.40       51.66    0.52


  1     200         88.53          59.44       59.98    0.60


  2     400         44.48          29.56       80.83    0.81


  5     600         34.04          20.42       90.64    0.91


  7     800         32.45          15.53       93.49    0.93


 10    1000         37.22          12.45       95.44    0.95


 14    1200         35.98           9.67       97.09    0.97


 19    1400         40.04           7.46       97.64    0.98


 25    1600         40.62           6.06       98.13    0.98


 32    1800         41.15           4.85       98.36    0.98


 41    2000         45.56           3.72       98.50    0.99


 52    2200         45.32           2.98       98.54    0.99


 65    2400         41.64           2.19       98.58    0.99


 78    2600         38.78           1.64       98.66    0.99


 91    2800         32.95           1.19       98.68    0.99


105    3000         26.45           0.95       98.59    0.99


118    3200         19.09           0.69       98.74    0.99


131    3400         15.33           0.54       98.82    0.99


145    3600         18.21           0.47       98.74    0.99


158    3800         10.34           0.35       98.77    0.99


171    4000          8.89           0.28       98.81    0.99


185    4200          9.73           0.26       98.76    0.99


198    4400          6.90           0.23       98.79    0.99


211    4600          4.85           0.17       98.75    0.99


225    4800          6.06           0.19       98.86    0.99


238    5000          5.12           0.11       98.83    0.99


251    5200          6.45           0.14       98.90    0.99


265    5400          5.97           0.11       98.85    0.99


278    5600          3.56           0.12       98.87    0.99


291    5800          2.53           0.07       98.89    0.99


305    6000          2.42           0.08       98.90    0.99


318    6200          1.64           0.08       98.93    0.99


331    6400          2.93           0.07       98.87    0.99


345    6600          3.67           0.06       98.87    0.99


358    6800          4.79           0.06       98.89    0.99


371    7000          2.11           0.02       98.89    0.99


385    7200          3.92           0.04       98.85    0.99


398    7400          0.81           0.01       98.90    0.99


411    7600          3.12           0.02       98.86    0.99


425    7800          1.88           0.03       98.84    0.99
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.93 
SPEED                 28910 

[1m

                                      P        R       F
Radioligand binding (BAO_0002776)    95.83   100.00   97.87
Binding (BAO_0002989)            100.00    85.19   92.00
Protein activity (BAO_0013016)    76.92    76.92   76.92
in vivo method (BAO_0040021)     100.00    93.33   96.55
Cell phenotype (BAO_0002542)      84.62    81.48   83.02
Nucleic acid binding              97.06    97.06   97.06
Antimicrobial activity            94.44    85.00   89.47

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      1.00
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.97
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.98
Nucleic acid binding                1.00
Antimicrobial activity              0.99

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running work

chunk10


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.41           0.46       51.85    0.52


  1     200         90.97          59.59       58.08    0.58


  2     400         48.04          29.72       79.81    0.80


  5     600         29.83          20.04       89.27    0.89


  7     800         31.96          15.42       93.57    0.94


 10    1000         32.44          11.74       95.28    0.95


 14    1200         35.75           8.99       96.50    0.97


 19    1400         36.24           7.12       97.10    0.97


 25    1600         40.75           5.83       97.61    0.98


 32    1800         43.85           4.86       97.88    0.98


 41    2000         43.31           3.66       97.84    0.98


 51    2200         45.36           2.73       97.85    0.98


 64    2400         43.34           2.10       97.97    0.98


 77    2600         38.00           1.59       97.95    0.98


 91    2800         29.93           1.15       98.11    0.98


104    3000         26.37           0.92       98.25    0.98


117    3200         18.35           0.62       98.43    0.98


131    3400         14.78           0.55       98.44    0.98


144    3600         12.74           0.42       98.37    0.98


157    3800         10.62           0.33       98.29    0.98


171    4000         10.85           0.28       98.39    0.98


184    4200          7.88           0.21       98.33    0.98


197    4400          7.99           0.20       98.40    0.98


211    4600          7.52           0.17       98.38    0.98


224    4800          5.64           0.10       98.40    0.98


237    5000          6.35           0.10       98.31    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.44 
SPEED                 28002 

[1m

                                      P        R        F
Radioligand binding (BAO_0002776)    92.59   100.00    96.15
Binding (BAO_0002989)             84.62    81.48    83.02
Protein activity (BAO_0013016)    86.11    79.49    82.67
in vivo method (BAO_0040021)     100.00   100.00   100.00
Cell phenotype (BAO_0002542)      83.33    88.24    85.71
Nucleic acid binding              96.77    93.75    95.24
Antimicrobial activity           100.00    76.19    86.49

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      1.00
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.98
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.98
Nucleic acid binding                0.99
Antimicrobial activity              0.96

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Runn

chunk11


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.40           0.41       53.54    0.54


  1     200         91.72          59.34       59.73    0.60


  3     400         46.63          29.23       79.64    0.80


  5     600         30.34          18.90       88.19    0.88


  7     800         32.19          13.97       91.48    0.91


 10    1000         33.79          11.16       92.58    0.93


 14    1200         33.45           8.41       93.75    0.94


 19    1400         34.97           6.68       94.46    0.94


 25    1600         35.00           5.13       95.09    0.95


 32    1800         39.17           4.15       95.09    0.95


 41    2000         41.26           3.27       95.56    0.96


 52    2200         40.11           2.24       96.34    0.96


 65    2400         43.91           1.62       96.44    0.96


 78    2600         30.83           1.16       96.44    0.96


 91    2800         26.71           0.82       96.79    0.97


105    3000         18.72           0.58       96.81    0.97


118    3200         15.80           0.41       96.84    0.97


131    3400         13.98           0.38       96.76    0.97


145    3600         12.03           0.29       96.97    0.97


158    3800         13.24           0.22       96.86    0.97


171    4000          8.78           0.14       96.82    0.97


185    4200          4.55           0.10       96.66    0.97


198    4400          5.42           0.10       96.64    0.97


211    4600          4.01           0.08       96.72    0.97


225    4800          7.27           0.07       96.50    0.96


238    5000          4.26           0.06       96.79    0.97


251    5200          2.74           0.05       96.72    0.97
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   96.97 
SPEED                 28836 

[1m

                                     P       R       F
Radioligand binding (BAO_0002776)   94.12   80.00   86.49
Binding (BAO_0002989)            86.21   96.15   90.91
Protein activity (BAO_0013016)   82.76   66.67   73.85
in vivo method (BAO_0040021)     90.91   76.92   83.33
Cell phenotype (BAO_0002542)     80.00   86.96   83.33
Nucleic acid binding             94.29   91.67   92.96
Antimicrobial activity           94.12   84.21   88.89

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.96
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.96
in vivo method (BAO_0040021)        0.92
Cell phenotype (BAO_0002542)        0.98
Nucleic acid binding                1.00
Antimicrobial activity              0.98

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'all'[0m


chunk12


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.18           0.45       56.06    0.56


  1     200         95.37          59.33       62.64    0.63


  3     400         47.52          29.06       80.19    0.80


  5     600         29.86          19.47       87.48    0.87


  7     800         33.30          14.90       91.68    0.92


 10    1000         33.39          11.43       93.69    0.94


 14    1200         35.62           9.13       94.75    0.95


 19    1400         39.04           7.13       94.96    0.95


 25    1600         41.35           5.63       95.79    0.96


 32    1800         43.01           4.42       96.28    0.96


 41    2000         42.99           3.52       96.34    0.96


 52    2200         43.79           2.64       96.65    0.97


 65    2400         41.21           1.91       97.06    0.97


 78    2600         34.19           1.48       97.54    0.98


 92    2800         26.00           1.07       97.69    0.98


105    3000         22.41           0.85       97.79    0.98


118    3200         17.90           0.63       98.05    0.98


132    3400         14.92           0.50       97.95    0.98


145    3600         11.67           0.41       98.01    0.98


158    3800          9.76           0.27       97.97    0.98


172    4000          8.61           0.24       98.06    0.98


185    4200          8.49           0.16       98.01    0.98


198    4400          5.96           0.14       98.10    0.98


212    4600          7.56           0.10       98.08    0.98


225    4800          3.86           0.07       97.93    0.98


238    5000          5.82           0.08       98.08    0.98


252    5200          2.62           0.06       98.12    0.98


265    5400          2.17           0.04       97.95    0.98


278    5600          4.22           0.06       97.95    0.98


292    5800          2.26           0.04       97.99    0.98


305    6000          1.77           0.03       97.97    0.98


318    6200          1.29           0.02       98.01    0.98


332    6400          1.52           0.03       97.96    0.98


345    6600          2.74           0.03       97.97    0.98


358    6800          3.18           0.04       97.96    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.12 
SPEED                 28865 

[1m

                                      P       R       F
Radioligand binding (BAO_0002776)   100.00   81.25   89.66
Binding (BAO_0002989)             88.24   88.24   88.24
Protein activity (BAO_0013016)    93.10   84.38   88.52
in vivo method (BAO_0040021)     100.00   94.12   96.97
Cell phenotype (BAO_0002542)      86.36   90.48   88.37
Nucleic acid binding              96.88   91.18   93.94
Antimicrobial activity           100.00   89.47   94.44

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.98
Binding (BAO_0002989)               0.99
Protein activity (BAO_0013016)      0.94
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.96
Nucleic acid binding                1.00
Antimicrobial activity              1.00

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'al

chunk13


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.55           0.34       53.55    0.54


  1     200         92.12          60.20       59.45    0.59


  2     400         44.67          28.49       83.33    0.83


  5     600         33.55          20.03       90.49    0.90


  7     800         34.86          15.82       93.37    0.93


 10    1000         36.23          11.89       95.65    0.96


 14    1200         38.65           9.83       96.89    0.97


 19    1400         38.95           7.51       97.45    0.97


 25    1600         40.98           5.96       97.82    0.98


 32    1800         42.25           4.86       98.28    0.98


 40    2000         44.91           3.63       98.38    0.98


 51    2200         41.25           2.54       98.46    0.98


 64    2400         43.87           1.97       98.55    0.99


 77    2600         33.20           1.34       98.52    0.99


 91    2800         27.21           1.02       98.61    0.99


104    3000         23.09           0.73       98.62    0.99


117    3200         17.55           0.55       98.70    0.99


131    3400         16.80           0.47       98.72    0.99


144    3600         15.42           0.38       98.72    0.99


157    3800         11.83           0.29       98.87    0.99


171    4000          8.99           0.24       98.77    0.99


184    4200          9.00           0.25       98.82    0.99


197    4400          9.96           0.22       98.76    0.99


211    4600          4.84           0.17       98.69    0.99


224    4800          6.25           0.15       98.69    0.99


237    5000          4.91           0.16       98.68    0.99


251    5200          3.23           0.14       98.75    0.99


264    5400          5.41           0.15       98.72    0.99
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.87 
SPEED                 28200 

[1m

                                      P        R       F
Radioligand binding (BAO_0002776)    90.91   100.00   95.24
Binding (BAO_0002989)            100.00    72.73   84.21
Protein activity (BAO_0013016)    81.48    81.48   81.48
in vivo method (BAO_0040021)      83.33   100.00   90.91
Cell phenotype (BAO_0002542)     100.00    76.92   86.96
Nucleic acid binding             100.00    97.50   98.73
Antimicrobial activity            85.71    70.59   77.42

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      1.00
Binding (BAO_0002989)               0.99
Protein activity (BAO_0013016)      0.97
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.97
Nucleic acid binding                1.00
Antimicrobial activity              0.99

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running work

chunk14


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.56           0.32       50.19    0.50


  1     200         89.44          59.69       56.52    0.57


  3     400         43.96          28.86       80.92    0.81


  5     600         33.82          19.18       89.18    0.89


  7     800         33.54          14.71       92.44    0.92


 10    1000         34.98          11.64       94.08    0.94


 14    1200         34.34           9.24       95.16    0.95


 19    1400         33.62           7.26       95.31    0.95


 25    1600         38.95           5.91       95.92    0.96


 32    1800         37.80           4.42       96.22    0.96


 41    2000         38.29           3.32       96.78    0.97


 52    2200         42.46           2.71       97.27    0.97


 65    2400         39.38           1.99       97.38    0.97


 78    2600         31.98           1.43       97.64    0.98


 91    2800         26.04           1.13       97.91    0.98


105    3000         23.77           0.89       98.22    0.98


118    3200         16.35           0.65       98.40    0.98


131    3400         14.26           0.49       98.37    0.98


145    3600         14.31           0.42       97.90    0.98


158    3800          8.09           0.30       98.16    0.98


171    4000          7.40           0.25       98.04    0.98


185    4200          6.51           0.20       97.95    0.98


198    4400          6.59           0.19       98.38    0.98


211    4600          5.35           0.17       98.26    0.98


225    4800          6.57           0.15       98.30    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.40 
SPEED                 28844 

[1m

                                      P        R        F
Radioligand binding (BAO_0002776)   100.00   100.00   100.00
Binding (BAO_0002989)            100.00    80.00    88.89
Protein activity (BAO_0013016)    91.67    89.19    90.41
in vivo method (BAO_0040021)      88.89    72.73    80.00
Cell phenotype (BAO_0002542)      95.45    80.77    87.50
Nucleic acid binding              91.43    94.12    92.75
Antimicrobial activity            90.91    83.33    86.96

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      1.00
Binding (BAO_0002989)               0.97
Protein activity (BAO_0013016)      0.99
in vivo method (BAO_0040021)        0.97
Cell phenotype (BAO_0002542)        0.99
Nucleic acid binding                1.00
Antimicrobial activity              0.98

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Runn

chunk15


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.65           0.44       51.17    0.51


  1     200         87.11          59.61       58.38    0.58


  2     400         45.38          29.12       80.11    0.80


  4     600         30.36          19.61       88.19    0.88


  7     800         30.14          15.27       91.83    0.92


 10    1000         30.37          11.74       93.01    0.93


 14    1200         34.39           9.54       94.39    0.94


 18    1400         37.94           7.59       95.16    0.95


 24    1600         39.87           6.21       95.55    0.96


 31    1800         42.51           4.92       95.97    0.96


 40    2000         41.16           3.90       96.49    0.96


 50    2200         44.32           2.99       96.81    0.97


 63    2400         41.72           2.13       97.12    0.97


 75    2600         30.66           1.57       96.94    0.97


 88    2800         28.44           1.26       96.97    0.97


100    3000         18.82           0.89       97.17    0.97


113    3200         17.19           0.73       96.99    0.97


125    3400         14.91           0.60       97.30    0.97


138    3600         12.99           0.50       96.81    0.97


150    3800         10.43           0.37       96.79    0.97


163    4000          6.94           0.27       97.03    0.97


175    4200          6.65           0.25       97.31    0.97


188    4400          5.03           0.22       97.36    0.97


200    4600          3.69           0.19       97.32    0.97


213    4800          5.78           0.19       97.27    0.97


225    5000          4.04           0.17       97.29    0.97


238    5200          2.35           0.14       97.47    0.97


250    5400          2.79           0.15       97.53    0.98


263    5600          3.37           0.13       97.68    0.98


275    5800          2.85           0.13       97.76    0.98


288    6000          2.03           0.08       97.78    0.98


300    6200          2.25           0.08       97.58    0.98


313    6400          0.95           0.07       97.61    0.98


325    6600          2.48           0.07       97.48    0.97


338    6800          1.51           0.05       97.41    0.97


350    7000          1.78           0.04       97.53    0.98


363    7200          0.91           0.02       97.60    0.98


375    7400          1.26           0.02       97.58    0.98


388    7600          2.39           0.03       97.60    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   97.78 
SPEED                 25646 

[1m

                                      P       R       F
Radioligand binding (BAO_0002776)    90.00   90.00   90.00
Binding (BAO_0002989)             90.32   87.50   88.89
Protein activity (BAO_0013016)    89.29   78.12   83.33
in vivo method (BAO_0040021)     100.00   77.78   87.50
Cell phenotype (BAO_0002542)      81.82   94.74   87.80
Nucleic acid binding              97.06   89.19   92.96
Antimicrobial activity           100.00   85.71   92.31

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.98
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.97
in vivo method (BAO_0040021)        0.94
Cell phenotype (BAO_0002542)        0.99
Nucleic acid binding                0.98
Antimicrobial activity              1.00

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'al

chunk16


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.43           0.27       52.27    0.52


  1     200         91.89          59.13       60.41    0.60


  3     400         46.88          28.81       80.89    0.81


  5     600         31.77          19.41       89.05    0.89


  7     800         32.72          14.98       92.13    0.92


 10    1000         36.70          11.60       94.53    0.95


 14    1200         38.88           9.22       95.23    0.95


 19    1400         39.22           7.54       95.99    0.96


 25    1600         38.60           5.74       96.55    0.97


 33    1800         43.41           4.55       97.00    0.97


 42    2000         44.70           3.41       97.06    0.97


 52    2200         47.26           2.60       97.50    0.98


 65    2400         44.17           1.90       97.51    0.98


 79    2600         34.60           1.31       97.52    0.98


 92    2800         27.06           0.97       97.72    0.98


105    3000         23.42           0.76       97.64    0.98


119    3200         16.49           0.58       97.54    0.98


132    3400         14.17           0.44       97.81    0.98


145    3600         14.26           0.40       97.64    0.98


159    3800          8.22           0.28       97.64    0.98


172    4000          7.67           0.24       97.67    0.98


185    4200          5.65           0.21       97.66    0.98


199    4400          7.01           0.18       97.58    0.98


212    4600          5.43           0.18       97.63    0.98


225    4800          5.69           0.17       97.66    0.98


239    5000          4.83           0.13       97.65    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   97.81 
SPEED                 29003 

[1m

                                      P        R        F
Radioligand binding (BAO_0002776)    86.96    90.91    88.89
Binding (BAO_0002989)             89.47    65.38    75.56
Protein activity (BAO_0013016)    86.49    82.05    84.21
in vivo method (BAO_0040021)     100.00   100.00   100.00
Cell phenotype (BAO_0002542)      84.00    80.77    82.35
Nucleic acid binding              91.43   100.00    95.52
Antimicrobial activity            93.75    93.75    93.75

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      1.00
Binding (BAO_0002989)               0.94
Protein activity (BAO_0013016)      0.97
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.94
Nucleic acid binding                1.00
Antimicrobial activity              1.00

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Runn

chunk17


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.54           0.43       50.66    0.51


  1     200         90.03          57.98       57.03    0.57


  2     400         45.35          27.82       80.87    0.81


  5     600         31.45          18.80       87.88    0.88


  7     800         31.20          14.14       90.88    0.91


 10    1000         32.15          11.17       93.21    0.93


 14    1200         32.77           8.44       94.55    0.95


 19    1400         34.31           7.01       95.62    0.96


 25    1600         36.58           5.43       96.24    0.96


 32    1800         39.55           4.24       96.69    0.97


 41    2000         43.22           3.42       96.90    0.97


 52    2200         40.05           2.50       97.28    0.97


 65    2400         39.21           1.79       97.52    0.98


 78    2600         29.05           1.25       97.49    0.97


 91    2800         24.24           0.91       97.68    0.98


105    3000         17.61           0.64       97.83    0.98


118    3200         14.52           0.50       97.59    0.98


131    3400         10.60           0.40       97.62    0.98


145    3600         11.22           0.34       97.72    0.98


158    3800          9.85           0.28       97.62    0.98


171    4000          5.77           0.21       97.65    0.98


185    4200          5.74           0.18       97.57    0.98


198    4400          5.55           0.15       97.44    0.97


211    4600          5.24           0.10       97.51    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   97.83 
SPEED                 8249  

[1m

                                      P       R       F
Radioligand binding (BAO_0002776)   100.00   87.50   93.33
Binding (BAO_0002989)             82.76   88.89   85.71
Protein activity (BAO_0013016)    78.79   83.87   81.25
in vivo method (BAO_0040021)     100.00   62.50   76.92
Cell phenotype (BAO_0002542)      92.00   95.83   93.88
Nucleic acid binding             100.00   93.75   96.77
Antimicrobial activity            87.50   73.68   80.00

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.99
Binding (BAO_0002989)               0.99
Protein activity (BAO_0013016)      0.96
in vivo method (BAO_0040021)        0.98
Cell phenotype (BAO_0002542)        0.98
Nucleic acid binding                1.00
Antimicrobial activity              0.95

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'al

chunk18


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.40           0.38       53.92    0.54


  1     200         87.44          60.12       62.08    0.62


  2     400         46.72          29.11       81.62    0.82


  5     600         33.25          20.05       92.67    0.93


  7     800         33.80          15.76       95.62    0.96


 10    1000         34.05          12.43       97.19    0.97


 14    1200         37.55          10.05       97.89    0.98


 19    1400         37.59           7.90       98.44    0.98


 25    1600         39.93           6.14       98.81    0.99


 32    1800         44.89           5.05       99.03    0.99


 41    2000         47.53           3.89       99.11    0.99


 52    2200         48.68           2.92       99.31    0.99


 65    2400         43.41           2.13       99.33    0.99


 78    2600         38.68           1.60       99.21    0.99


 91    2800         27.32           1.12       99.15    0.99


105    3000         24.32           0.87       99.22    0.99


118    3200         18.91           0.65       99.17    0.99


131    3400         13.66           0.47       99.01    0.99


145    3600         13.19           0.42       98.92    0.99


158    3800         13.81           0.33       99.03    0.99


171    4000          7.83           0.25       99.11    0.99
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   99.33 
SPEED                 28492 

[1m

                                      P        R        F
Radioligand binding (BAO_0002776)   100.00    95.00    97.44
Binding (BAO_0002989)             95.65    88.00    91.67
Protein activity (BAO_0013016)    90.32    80.00    84.85
in vivo method (BAO_0040021)     100.00   100.00   100.00
Cell phenotype (BAO_0002542)      78.95    78.95    78.95
Nucleic acid binding              94.59    92.11    93.33
Antimicrobial activity           100.00    73.91    85.00

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      1.00
Binding (BAO_0002989)               1.00
Protein activity (BAO_0013016)      0.98
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.99
Nucleic acid binding                1.00
Antimicrobial activity              1.00

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Runn

chunk19


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.51           0.42       56.99    0.57


  1     200         90.51          60.18       60.77    0.61


  3     400         43.56          29.53       76.52    0.77


  5     600         30.50          20.52       85.60    0.86


  7     800         30.36          14.44       89.29    0.89


 10    1000         32.41          11.76       91.25    0.91


 14    1200         38.15           9.30       93.26    0.93


 19    1400         35.98           7.20       94.35    0.94


 25    1600         35.66           5.40       94.95    0.95


 32    1800         41.46           4.45       95.93    0.96


 41    2000         39.32           3.29       96.31    0.96


 52    2200         39.72           2.49       96.91    0.97


 65    2400         37.20           1.99       97.27    0.97


 78    2600         35.19           1.55       97.30    0.97


 91    2800         29.72           1.12       97.43    0.97


105    3000         22.43           0.81       97.55    0.98


118    3200         19.37           0.58       97.86    0.98


131    3400         14.31           0.43       97.71    0.98


145    3600         11.27           0.32       97.74    0.98


158    3800         12.84           0.30       97.74    0.98


171    4000         10.45           0.26       97.76    0.98


185    4200          6.90           0.20       97.77    0.98


198    4400          6.40           0.20       97.84    0.98


211    4600          5.51           0.18       97.82    0.98


225    4800          6.98           0.16       97.71    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   97.86 
SPEED                 27680 

[1m

                                      P       R       F
Radioligand binding (BAO_0002776)    95.24   90.91   93.02
Binding (BAO_0002989)             91.30   87.50   89.36
Protein activity (BAO_0013016)    77.14   79.41   78.26
in vivo method (BAO_0040021)      92.31   85.71   88.89
Cell phenotype (BAO_0002542)     100.00   76.00   86.36
Nucleic acid binding              94.44   91.89   93.15
Antimicrobial activity           100.00   87.50   93.33

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.97
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.95
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.99
Nucleic acid binding                1.00
Antimicrobial activity              0.96

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'al

chunk20


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.76           0.43       54.22    0.54


  1     200         90.90          59.94       61.89    0.62


  2     400         48.30          28.96       79.98    0.80


  5     600         32.04          19.42       86.75    0.87


  7     800         32.68          14.88       90.16    0.90


 10    1000         32.48          11.48       92.40    0.92


 14    1200         37.81           8.84       94.00    0.94


 19    1400         38.22           6.83       94.42    0.94


 25    1600         38.77           5.29       95.03    0.95


 32    1800         41.75           4.08       95.34    0.95


 41    2000         44.96           3.11       95.33    0.95


 52    2200         41.10           2.28       95.74    0.96


 65    2400         39.52           1.63       96.04    0.96


 78    2600         30.44           1.17       96.11    0.96


 91    2800         22.96           0.90       96.27    0.96


105    3000         21.97           0.63       96.41    0.96


118    3200         15.22           0.43       96.40    0.96


131    3400         13.20           0.34       96.44    0.96


145    3600         14.18           0.22       96.43    0.96


158    3800         10.86           0.18       96.38    0.96


171    4000          7.94           0.13       96.46    0.96


185    4200          6.91           0.12       96.53    0.97


198    4400          4.97           0.09       96.51    0.97


211    4600          4.46           0.07       96.41    0.96


225    4800          4.70           0.06       96.48    0.96


238    5000          4.66           0.06       96.42    0.96


251    5200          2.51           0.06       96.32    0.96


265    5400          3.60           0.04       96.30    0.96


278    5600          2.50           0.03       96.41    0.96


291    5800          2.62           0.04       96.40    0.96
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   96.53 
SPEED                 27175 

[1m

                                      P       R       F
Radioligand binding (BAO_0002776)    91.30   87.50   89.36
Binding (BAO_0002989)             84.21   72.73   78.05
Protein activity (BAO_0013016)    78.38   80.56   79.45
in vivo method (BAO_0040021)      81.82   90.00   85.71
Cell phenotype (BAO_0002542)      87.50   75.00   80.77
Nucleic acid binding             100.00   93.75   96.77
Antimicrobial activity           100.00   76.19   86.49

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.99
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.96
in vivo method (BAO_0040021)        0.92
Cell phenotype (BAO_0002542)        0.93
Nucleic acid binding                1.00
Antimicrobial activity              0.97

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'al

chunk21


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.37           0.37       54.91    0.55


  1     200         88.06          60.02       62.61    0.63


  2     400         45.16          29.34       81.91    0.82


  5     600         31.21          19.28       89.49    0.89


  7     800         31.00          14.69       92.53    0.93


 10    1000         33.10          11.82       93.45    0.93


 14    1200         34.78           9.29       94.14    0.94


 19    1400         37.09           7.45       95.00    0.95


 25    1600         41.44           6.13       96.00    0.96


 32    1800         42.03           4.84       96.40    0.96


 41    2000         42.83           3.71       96.70    0.97


 52    2200         41.16           2.75       97.14    0.97


 64    2400         43.57           2.09       97.59    0.98


 78    2600         35.73           1.52       98.28    0.98


 91    2800         25.52           1.10       98.33    0.98


104    3000         21.75           0.89       98.20    0.98


118    3200         19.42           0.66       97.94    0.98


131    3400         16.18           0.49       97.64    0.98


144    3600         11.20           0.38       97.96    0.98


158    3800          9.56           0.30       97.98    0.98


171    4000          6.86           0.23       97.87    0.98


184    4200          6.65           0.21       97.97    0.98


198    4400          3.97           0.16       98.13    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.33 
SPEED                 28435 

[1m

                                      P        R       F
Radioligand binding (BAO_0002776)    87.50    95.45   91.30
Binding (BAO_0002989)             90.62    82.86   86.57
Protein activity (BAO_0013016)    84.85    80.00   82.35
in vivo method (BAO_0040021)     100.00    90.91   95.24
Cell phenotype (BAO_0002542)     100.00    80.00   88.89
Nucleic acid binding              91.89   100.00   95.77
Antimicrobial activity            93.33    87.50   90.32

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.99
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.97
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.99
Nucleic acid binding                1.00
Antimicrobial activity              0.95

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running work

chunk22


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.48           0.37       53.12    0.53


  1     200         91.65          59.72       59.94    0.60


  3     400         47.34          29.45       81.53    0.82


  5     600         32.12          19.28       89.60    0.90


  7     800         32.92          15.16       93.61    0.94


 10    1000         36.31          11.96       95.80    0.96


 14    1200         36.17           9.32       96.91    0.97


 19    1400         39.26           7.44       97.39    0.97


 25    1600         39.12           5.77       97.80    0.98


 32    1800         43.30           4.67       97.99    0.98


 41    2000         43.42           3.50       98.18    0.98


 52    2200         44.20           2.69       98.42    0.98


 65    2400         46.37           2.06       98.67    0.99


 78    2600         35.08           1.52       98.84    0.99


 92    2800         30.25           1.14       98.92    0.99


105    3000         23.83           0.89       99.03    0.99


118    3200         18.19           0.67       99.02    0.99


132    3400         13.21           0.51       99.02    0.99


145    3600         13.63           0.48       98.99    0.99


158    3800         11.75           0.41       98.99    0.99


172    4000          9.37           0.33       99.04    0.99


185    4200          8.21           0.26       99.04    0.99


198    4400          7.90           0.24       99.13    0.99


212    4600          5.56           0.16       99.02    0.99


225    4800          7.05           0.15       99.07    0.99


238    5000          3.66           0.10       99.10    0.99


252    5200          4.67           0.10       99.02    0.99


265    5400          4.62           0.07       99.08    0.99


278    5600          3.51           0.06       99.13    0.99


292    5800          4.93           0.05       99.10    0.99


305    6000          4.15           0.04       99.13    0.99


318    6200          2.83           0.04       99.12    0.99


332    6400          1.65           0.03       99.16    0.99


345    6600          0.98           0.02       99.19    0.99


358    6800          2.07           0.04       99.20    0.99


372    7000          2.17           0.02       99.15    0.99


385    7200          1.19           0.02       99.26    0.99


398    7400          1.69           0.02       99.20    0.99


412    7600          1.81           0.02       99.18    0.99


425    7800          1.67           0.02       99.22    0.99


438    8000          1.43           0.02       99.19    0.99


452    8200          0.97           0.02       99.25    0.99


465    8400          1.13           0.02       99.21    0.99


478    8600          0.91           0.02       99.21    0.99


492    8800          0.63           0.01       99.19    0.99
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   99.26 
SPEED                 27830 

[1m

                                      P        R       F
Radioligand binding (BAO_0002776)    94.12    88.89   91.43
Binding (BAO_0002989)             96.15    83.33   89.29
Protein activity (BAO_0013016)    96.77    85.71   90.91
in vivo method (BAO_0040021)     100.00    85.71   92.31
Cell phenotype (BAO_0002542)      94.74    85.71   90.00
Nucleic acid binding              92.68   100.00   96.20
Antimicrobial activity            93.33    82.35   87.50

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      1.00
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.99
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.99
Nucleic acid binding                1.00
Antimicrobial activity              0.99

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running work

chunk23


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.21           0.40       52.59    0.53


  1     200         88.59          59.28       58.76    0.59


  3     400         48.87          28.74       81.31    0.81


  5     600         32.79          18.79       89.01    0.89


  7     800         31.88          14.48       92.14    0.92


 10    1000         33.06          11.29       93.92    0.94


 14    1200         33.53           9.07       94.85    0.95


 19    1400         35.81           6.90       95.69    0.96


 25    1600         38.70           5.71       96.14    0.96


 32    1800         35.66           4.27       96.84    0.97


 41    2000         39.17           3.52       97.18    0.97


 52    2200         40.25           2.69       97.48    0.97


 65    2400         41.78           2.11       97.58    0.98


 78    2600         34.45           1.61       97.93    0.98


 91    2800         27.53           1.14       98.13    0.98


105    3000         21.83           0.88       98.20    0.98


118    3200         19.29           0.65       98.23    0.98


131    3400         14.49           0.48       98.21    0.98


145    3600         12.69           0.41       98.32    0.98


158    3800         10.86           0.31       98.39    0.98


171    4000          8.47           0.20       98.47    0.98


185    4200          9.23           0.15       98.55    0.99


198    4400          6.10           0.13       98.54    0.99


211    4600          8.68           0.12       98.45    0.98


225    4800          4.71           0.09       98.45    0.98


238    5000          6.01           0.07       98.33    0.98


251    5200          5.38           0.07       98.14    0.98


265    5400          4.29           0.05       97.99    0.98


278    5600          2.12           0.04       98.16    0.98


291    5800          1.73           0.04       98.29    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.55 
SPEED                 28142 

[1m

                                      P       R       F
Radioligand binding (BAO_0002776)   100.00   96.00   97.96
Binding (BAO_0002989)             79.31   79.31   79.31
Protein activity (BAO_0013016)    80.00   84.85   82.35
in vivo method (BAO_0040021)     100.00   78.57   88.00
Cell phenotype (BAO_0002542)      79.31   95.83   86.79
Nucleic acid binding             100.00   91.18   95.38
Antimicrobial activity            84.62   84.62   84.62

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.99
Binding (BAO_0002989)               0.98
Protein activity (BAO_0013016)      0.96
in vivo method (BAO_0040021)        0.97
Cell phenotype (BAO_0002542)        0.99
Nucleic acid binding                1.00
Antimicrobial activity              0.99

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running workflow 'al

chunk24


[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          1.20           0.39       51.13    0.51


  1     200         90.91          58.86       57.87    0.58


  2     400         42.30          29.07       78.03    0.78


  5     600         33.06          19.95       89.83    0.90


  7     800         30.64          14.83       93.75    0.94


 10    1000         33.24          11.87       95.53    0.96


 14    1200         36.36           9.42       96.34    0.96


 19    1400         37.86           7.59       96.99    0.97


 24    1600         41.56           5.87       97.36    0.97


 31    1800         40.18           4.77       97.48    0.97


 40    2000         44.18           3.73       97.80    0.98


 51    2200         44.00           2.88       98.08    0.98


 64    2400         42.21           2.07       98.11    0.98


 77    2600         35.82           1.51       98.11    0.98


 90    2800         28.35           1.06       98.11    0.98


104    3000         22.00           0.78       98.20    0.98


117    3200         17.33           0.57       98.20    0.98


130    3400         15.44           0.43       98.18    0.98


144    3600          9.58           0.34       98.29    0.98


157    3800          8.69           0.29       98.23    0.98


170    4000          8.94           0.23       98.01    0.98


184    4200          5.94           0.18       98.03    0.98


197    4400          8.14           0.16       98.00    0.98


210    4600          6.38           0.13       98.02    0.98


224    4800          3.76           0.09       98.10    0.98


237    5000          4.36           0.07       98.06    0.98


250    5200          2.33           0.05       98.03    0.98
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Using CPU[0m
[1m

TOK                   100.00
TEXTCAT (macro AUC)   98.29 
SPEED                 27661 

[1m

                                      P        R       F
Radioligand binding (BAO_0002776)   100.00    96.55   98.25
Binding (BAO_0002989)             81.82   100.00   90.00
Protein activity (BAO_0013016)    91.30    65.62   76.36
in vivo method (BAO_0040021)     100.00    85.71   92.31
Cell phenotype (BAO_0002542)      85.00    85.00   85.00
Nucleic acid binding             100.00    89.47   94.44
Antimicrobial activity            94.44    80.95   87.18

[1m

                                 ROC AUC
Radioligand binding (BAO_0002776)      0.98
Binding (BAO_0002989)               0.99
Protein activity (BAO_0013016)      0.95
in vivo method (BAO_0040021)        1.00
Cell phenotype (BAO_0002542)        0.97
Nucleic acid binding                1.00
Antimicrobial activity              0.98

[38;5;2m✔ Saved results to training/metrics.json[0m
[38;5;4mℹ Running work

##### Cross-Validation metrics

In [22]:
metrics_all = []

for f in Path(mpath).iterdir():
    if not f.name.endswith('.csv'):
        chunk = f.name
        metricspath = os.path.join(f,'training/metrics.json')
        # Open JSON File
        with open(metricspath, 'r') as file:
            data = json.load(file)['cats_f_per_type']
            df = pd.DataFrame.from_dict(data, orient='index')
            df = df.map(lambda x: format(x, ".2f"))
            df['fold'] = chunk
            df.reset_index(inplace=True)
            df.rename(columns={'index': 'category'}, inplace=True)
            metrics_all.append(df)

In [23]:
pd.concat(metrics_all).sort_values(by='category').to_csv('./results/models_metrics.csv', index=False)