**Protein Classification**

Installation of all relevant dependencies using ```pip```







In [None]:
!pip install transformers

In [None]:
!pip install datasets

In [None]:
!pip install accelerate -U

You need to have the file ```protein_classification.json``` (cf. instructions in the README of the repository)on your local machine and upload it to the colab environment.


In [4]:
from google.colab import files
uploaded = files.upload()

Saving protein_classification.json to protein_classification.json


Let us write a helper function to load data from a json file and store it in a pandas dataframe

In [3]:
import pandas as pd
def get_data(json_file_name):
  df=pd.read_json(json_file_name)
  return df

In [4]:
df = get_data('protein_classification.json')

In [5]:
df.head()

Unnamed: 0,Location,ID,Sequence,Test,Is_Cell_Membrane
0,Cell.membrane-M,Q9H400,MGLPVSWAPPALWVLGCCALLLSLWALCTACRRPEDAVAPRKRARR...,True,1
1,Cell.membrane-M,Q5I0E9,MEVLEEPAPGPGGADAAERRGLRRLLLSGFQEELRALLVLAGPAFL...,False,1
2,Cell.membrane-M,P63033,MMKTLSSGNCTLNVPAKNSYRMVVLGASRVGKSSIVSRFLNGRFED...,False,1
3,Cell.membrane-M,Q9NR71,MAKRTFSNLETFLIFLLVMMSAITVALLSLLFITSGTIENHKDLGG...,False,1
4,Cell.membrane-M,Q86XT9,MGNCQAGHNLHLCLAHHPPLVCATLILLLLGLSGLGLGSFLLTHRT...,False,1


In [6]:
print(df['Location'].unique())

['Cell.membrane-M' 'Cytoplasm-Nucleus-U' 'Cytoplasm-S'
 'Endoplasmic.reticulum-M' 'Endoplasmic.reticulum-U'
 'Endoplasmic.reticulum-S' 'Golgi.apparatus-M' 'Golgi.apparatus-U'
 'Golgi.apparatus-S' 'Lysosome/Vacuole-M' 'Lysosome/Vacuole-U'
 'Lysosome/Vacuole-S' 'Mitochondrion-U' 'Mitochondrion-M'
 'Mitochondrion-S' 'Nucleus-U' 'Nucleus-M' 'Nucleus-S' 'Peroxisome-M'
 'Peroxisome-U' 'Peroxisome-S' 'Plastid-U' 'Plastid-S' 'Plastid-M'
 'Extracellular-S']


The data shows in the ```Sequence``` column a sequence of amino acids of a protein and its location in the ```Location``` column. To simplify the task, we are only interested if the protein is located in the membrane or not, indicated by the column ```Is_Cell_Membrane```. To avoid long training times, we reduce the number of records in the data set.

In [7]:
df_is_cell_membrane = df[df['Is_Cell_Membrane'] == 1].sample(n=200, random_state=42)
df_is_not_cell_membrane = df[df['Is_Cell_Membrane'] == 0].sample(n=200, random_state=42)
df_union = pd.concat([df_is_cell_membrane, df_is_not_cell_membrane])
# Shuffle the records
df_union_shuffled = df_union.sample(frac=1, random_state=42).reset_index(drop=True)

In [8]:
df_union_shuffled.head()

Unnamed: 0,Location,ID,Sequence,Test,Is_Cell_Membrane
0,Extracellular-S,Q92626,MAKRSRGPGRRCLLALVLFCAWGTLAVVAQKPGAGCPSRCLCFRTT...,False,0
1,Lysosome/Vacuole-M,Q8VZR6,MTLTIPNAPGSSGYLDMFPERRMSYFGNSYILGLTVTAGIGGLLFG...,False,0
2,Cell.membrane-M,Q9HBA0-5,MADSSEGPRAGPGEVAELPGDESGTPGDGRPNLRMKFQGAFRKGVP...,False,1
3,Cytoplasm-S,A0JNT9,MSAFCLGLAGRASAPAEPDSACCMELPAGAGDAVRSPATAAALVSF...,False,0
4,Cell.membrane-M,Q9LQU4,MEAQHLHAKPHAEGEWSTGFCDCFSDCKNCCITFWCPCITFGQVAE...,True,1


Since we will use the Huggingface ```Trainer``` class, the column indicating the class of a protein (membrane or not mebrane) must be renamed to ```label```

In [9]:
df_union_shuffled.rename(columns={'Is_Cell_Membrane': 'label'}, inplace=True)

Here could a typical 'What could go wrong' happen. The amino acids needs to be seperated by a whitespace, since protBERT is trained in that way, cf. the [documentation](https://huggingface.co/Rostlab/prot_bert). But the provided dataset has no whitspaces. The model would not be able to learn if we do not include the whitespaces. We can use a hlper function to include whitespaces.

In [10]:
def insert_space(column):
    # Apply the function to each element in the column
    column_with_space = column.apply(lambda x: ' '.join(list(x)))
    return column_with_space

In [11]:
df_union_shuffled['Sequence_with_space'] = insert_space(df_union_shuffled['Sequence'])

In [12]:
df_union_shuffled.head()

Unnamed: 0,Location,ID,Sequence,Test,label,Sequence_with_space
0,Extracellular-S,Q92626,MAKRSRGPGRRCLLALVLFCAWGTLAVVAQKPGAGCPSRCLCFRTT...,False,0,M A K R S R G P G R R C L L A L V L F C A W G ...
1,Lysosome/Vacuole-M,Q8VZR6,MTLTIPNAPGSSGYLDMFPERRMSYFGNSYILGLTVTAGIGGLLFG...,False,0,M T L T I P N A P G S S G Y L D M F P E R R M ...
2,Cell.membrane-M,Q9HBA0-5,MADSSEGPRAGPGEVAELPGDESGTPGDGRPNLRMKFQGAFRKGVP...,False,1,M A D S S E G P R A G P G E V A E L P G D E S ...
3,Cytoplasm-S,A0JNT9,MSAFCLGLAGRASAPAEPDSACCMELPAGAGDAVRSPATAAALVSF...,False,0,M S A F C L G L A G R A S A P A E P D S A C C ...
4,Cell.membrane-M,Q9LQU4,MEAQHLHAKPHAEGEWSTGFCDCFSDCKNCCITFWCPCITFGQVAE...,True,1,M E A Q H L H A K P H A E G E W S T G F C D C ...


We start to split the data into a train/val/test set

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split

def train_test_val_split(df):
  # Split the DataFrame into train and remaining data
  df_train, df_remaining = train_test_split(df, test_size=0.3, random_state=42)

  # Split the remaining data into test and validation sets
  df_test, df_val = train_test_split(df_remaining, test_size=0.5, random_state=42)

  # Print the shapes of the resulting DataFrames
  print("Train set shape:", df_train.shape)
  print("Test set shape:", df_test.shape)
  print("Validation set shape:", df_val.shape)
  return df_train, df_val, df_test

In [14]:
df_train, df_val, df_test = train_test_val_split(df_union_shuffled)

Train set shape: (280, 6)
Test set shape: (60, 6)
Validation set shape: (60, 6)


We want to utilize pre-trained model which incorporate knowledge of proteins. It is based on a Transformer architecture and traind by masking, comparable to pre-trained language models, cf. [ProteinBERT](https://academic.oup.com/bioinformatics/article/38/8/2102/6502274) for details. But the tokenizer configuration is more important compared to NLP tasks since truncating the sequences of amino acids can have a big impact.

In [None]:
from transformers import BertForSequenceClassification, BertTokenizer, TrainingArguments, Trainer
import torch
from datasets import Dataset
import pandas as pd
from sklearn.metrics import accuracy_score

# Load the pre-trained model and tokenizer
model_name = 'yarongef/DistilProtBert' #'Rostlab/prot_bert'
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = BertTokenizer.from_pretrained(model_name)

In the same way as for text classification, we need to tokenize the input (amino acid sequence of the protein) and map the tokens to integer ids (```input_ids```)

In [16]:
def get_dataset_encodings(df, sentence_key='Sequence_with_space'):
  # convert to Dataset type
  dataset_ = Dataset.from_pandas(df)

  def encode_records(record, tokenizer=tokenizer, sentence_key=sentence_key):
    #return tokenizer(record[sentence_key], truncation=True, padding=True)
    return tokenizer(record[sentence_key], truncation=True, max_length=512, padding='max_length', add_special_tokens=True, return_token_type_ids=False, return_attention_mask=True)

  # Tokenize and encode data
  dataset_encodings = dataset_.map(encode_records, batched=False)
  return dataset_encodings

In [17]:
dataset_train_encodings = get_dataset_encodings(df_train)
dataset_val_encodings = get_dataset_encodings(df_val)
dataset_test_encodings = get_dataset_encodings(df_test)

Map:   0%|          | 0/280 [00:00<?, ? examples/s]

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

Also as for NLP tasks, we use the ```TrainerArguments``` class and the ```Trainer``` class as an abstraction for the training process

In [19]:
training_args = TrainingArguments(
    output_dir='./results',  # Directory to save the model checkpoints
    num_train_epochs=10,
    #learning_rate = 0.1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.001, # 0.01
    logging_dir='./logs',  # Directory for storing logs
    logging_steps=1000,
    evaluation_strategy='epoch', #'epoch'
    save_strategy='no', #'epoch'
    metric_for_best_model='accuracy',
    seed=0
)

In [20]:
def compute_accuracy(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    accuracy = accuracy_score(labels,preds)
    return {'accuracy': accuracy}

In [21]:
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics = compute_accuracy,
    train_dataset=dataset_train_encodings,
    eval_dataset=dataset_val_encodings
)

In [None]:
trainer.train()



Epoch,Training Loss,Validation Loss



Why is the model not learning? Answer: The amino acids needs to be seperated by a whitespace, since protBERT is trained in that way, cf. the [documentation](https://huggingface.co/Rostlab/prot_bert).

In [None]:
def insert_space(column):
    # Apply the function to each element in the column
    column_with_space = column.apply(lambda x: ' '.join(list(x)))
    return column_with_space