## Broad category model for assay descriptions
The notebook manages the main pipeline for model training in 100% data

In [7]:
import pandas as pd
from pathlib import Path
import os
import json
import spacy
from sklearn.model_selection import RepeatedKFold
import shutil
import subprocess

### Setup

In [8]:
# Set path to python in the environment to use for model training
env_path = "/hps/software/users/chembl/ines/assays_description/bin/python"

In [9]:
#Settings for display (if needed)
pd.set_option('display.max_colwidth', None)  # Set to None to display the full column width
pd.set_option('display.max_rows', None)      # Set to None to display al

### Model training: Binding assays and functional assays

##### Clean up dataset

In [10]:
dataset = pd.read_csv('data/2_broad_category_training_data.csv')

In [11]:
dataset.head()

Unnamed: 0,assay_id,assay_type,description,label,bao_preferred_term,bao_id
0,868,B,"Inhibition of [3H]8-hydroxy-2-dipropylamino-1,2,3,4-tetrahydronaphthalene binding to 5-hydroxytryptamine 1A receptor in hippocampus region of rat brain; Residual radioligand binding higher than 50%","Radioligand competition, displacement, binding",radioligand binding assay,BAO_0002776
1,2027,B,Displacement of [3H]-5-HT from human 5-hydroxytryptamine 1D receptor beta,"Radioligand competition, displacement, binding",radioligand binding assay,BAO_0002776
2,2430,B,Inhibition constant for in vitro inhibition of [3H]ketanserin binding to rat frontal cortex membranes 5-hydroxytryptamine 2A receptor,"Radioligand competition, displacement, binding",radioligand binding assay,BAO_0002776
3,3306,B,Compound was evaluated for the binding affinity against human cloned 5-hydroxytryptamine 4 receptor in HeLa cells using [3H]-LSD as the radioligand,"Radioligand competition, displacement, binding",radioligand binding assay,BAO_0002776
4,3703,B,In vitro binding affinity by radioligand binding assay using cell line expressing human 5-hydroxytryptamine 7 receptor; ND means not determined,"Radioligand competition, displacement, binding",radioligand binding assay,BAO_0002776


In [12]:
dataset.value_counts(subset='label')

label
Nucleic acid binding                              176
Protein activity                                  171
Binding affinity, displacement, competition       134
Radioligand competition, displacement, binding    118
Cell phenotype                                    113
Antimicrobial activity                             88
in vivo method                                     63
Name: count, dtype: int64

In [7]:
# Can only train the large categories
chosen_categories = [
    'Nucleic acid binding'
    , 'Protein activity'
    , 'Binding affinity, displacement, competition'
    , 'Radioligand competition, displacement, binding'
    , 'Cell phenotype'
    , 'Antimicrobial activity'
    , 'in vivo method'
]

In [8]:
processed_df = dataset.loc[dataset['label'].isin(chosen_categories)]

In [9]:
value_dict = {
    'Radioligand competition, displacement, binding': {'Radioligand binding (BAO_0002776)': 1.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}
    , 'Binding affinity, displacement, competition': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 1.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}
    , 'Protein activity': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 1.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}
    , 'in vivo method': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 1.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}
    , 'Cell phenotype': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 1.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 0.0}
    , 'Nucleic acid binding': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 1.0, 'Antimicrobial activity': 0.0}
    , 'Antimicrobial activity': {'Radioligand binding (BAO_0002776)': 0.0, 'Binding (BAO_0002989)': 0.0, 'Protein activity (BAO_0013016)': 0.0, 'in vivo method (BAO_0040021)': 0.0, 'Cell phenotype (BAO_0002542)': 0.0, 'Nucleic acid binding': 0.0, 'Antimicrobial activity': 1.0}

}

In [10]:
processed_df['cats'] = (
    processed_df['label']
    .apply(lambda x: value_dict[x])
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processed_df['cats'] = (


In [11]:
processed_df.rename(columns={'description': 'text'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processed_df.rename(columns={'description': 'text'}, inplace=True)


In [12]:
processed_df = processed_df[['text', 'cats']]

#### Save training file

In [13]:
mpath = "Model"
dirpath = Path(mpath)
if dirpath.exists():
    shutil.rmtree(dirpath)
os.makedirs(mpath)

# Write to JSONL files
with open(os.path.join(mpath,'assays_train.jsonl'), 'w') as f:
    f.write(processed_df.to_json(orient='records', lines=True))

#### Do training

In [14]:
pwd

'/homes/ines/repos/assays_description/2_broad_assay_category'

In [15]:
#set up path for training data
train = os.path.join('../', mpath,'assays_train.jsonl')

# Chdir to the pipeline template folder
os.chdir('textcat_broad_categories')

os.mkdir('assets')
os.mkdir('training')

#Copy the current input files to the pipeline path
shutil.copy(train, os.path.join('assets/'))

#Run the pipeline
command = f'{env_path} -m weasel run final'
subprocess.run(command, shell=True, capture_output=False, text=True)

#Move outputs to the folder
opath = os.path.join('../', mpath, 'training')
shutil.copytree('training', opath, dirs_exist_ok=True)

# Remove directories to start clean
shutil.rmtree('assets')
shutil.rmtree('training')
shutil.rmtree('corpus')
os.remove('project.lock')

os.chdir('../')

[38;5;4mℹ Saving to output directory: training[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'textcat_multilabel'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TOK2VEC  LOSS TEXTC...  CATS_SCORE  SCORE 
---  ------  ------------  -------------  ----------  ------


  0       0          0.58           0.43       52.96    0.53


  1     200         92.62          60.48       61.01    0.61


  2     400         47.02          29.76       79.00    0.79


  4     600         31.08          21.00       89.29    0.89


  6     800         31.99          15.39       93.65    0.94


  8    1000         34.01          12.31       95.90    0.96


 11    1200         37.30           9.70       97.12    0.97


 15    1400         35.54           8.11       97.93    0.98


 20    1600         42.54           6.54       98.62    0.99


 26    1800         44.57           5.26       98.85    0.99


 33    2000         43.40           4.12       99.33    0.99


 41    2200         48.43           3.17       99.48    0.99


 52    2400         49.04           2.44       99.53    1.00


 62    2600         41.28           1.88       99.60    1.00


 73    2800         31.80           1.49       99.62    1.00


 83    3000         25.34           1.10       99.65    1.00


 94    3200         25.31           0.96       99.69    1.00


104    3400         20.44           0.70       99.77    1.00


115    3600         14.80           0.55       99.76    1.00


125    3800         15.27           0.47       99.79    1.00


136    4000         11.09           0.37       99.83    1.00


146    4200         10.46           0.36       99.88    1.00


157    4400          9.07           0.28       99.90    1.00


167    4600          9.23           0.30       99.92    1.00


178    4800          4.35           0.23       99.94    1.00


188    5000          5.90           0.20       99.97    1.00


199    5200          3.68           0.18       99.98    1.00


209    5400          5.60           0.16       99.98    1.00


220    5600          2.65           0.16       99.99    1.00


231    5800          4.87           0.14      100.00    1.00


241    6000          3.38           0.12      100.00    1.00


252    6200          2.48           0.09      100.00    1.00


262    6400          3.65           0.09      100.00    1.00


273    6600          4.23           0.06      100.00    1.00


283    6800          2.68           0.05      100.00    1.00


294    7000          1.79           0.03      100.00    1.00


304    7200          1.34           0.04      100.00    1.00


315    7400          2.11           0.03      100.00    1.00


325    7600          3.97           0.04      100.00    1.00


336    7800          1.72           0.02      100.00    1.00
[38;5;2m✔ Saved pipeline to output directory[0m
training/model-last


[38;5;4mℹ Running workflow 'final'[0m
[1m
Running command: /hps/software/users/chembl/ines/assays_description/bin/python scripts/convert.py en assets/assays_train.jsonl corpus/train.spacy
[1m
Running command: /hps/software/users/chembl/ines/assays_description/bin/python -m spacy train configs/config.cfg --output training/ --paths.train corpus/train.spacy --nlp.lang en --gpu-id -1
