<a href="https://colab.research.google.com/github/ahmedovich19/Machine-Learning-Projects/blob/master/BioAsq_question_answering_with_t5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install --quiet transformers==4.1.1
!pip install --quiet pytorch-lightning==1.1.1
!pip install --quiet tokenizers==0.9.4
!pip install --quiet sentencepiece==0.1.94


[K     |████████████████████████████████| 1.5MB 8.1MB/s 
[K     |████████████████████████████████| 2.9MB 39.9MB/s 
[K     |████████████████████████████████| 890kB 52.6MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 675kB 9.2MB/s 
[K     |████████████████████████████████| 645kB 15.7MB/s 
[K     |████████████████████████████████| 829kB 27.5MB/s 
[K     |████████████████████████████████| 112kB 49.0MB/s 
[?25h  Building wheel for future (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 1.1MB 7.8MB/s 
[?25h

In [None]:
import json
import pandas as pd 
import numpy as np
import torch
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split
from termcolor import colored
import textwrap

from transformers import (
    AdamW,
    T5ForConditionalGeneration,
    T5TokenizerFast as T5Tokenizer
)


In [None]:
pl.seed_everything(42)

42

In [None]:
%cd /content/drive/My\ Drive/

/content/drive/My Drive


In [None]:
!unzip -q QA.zip

In [None]:
with Path("BioASQ/BioASQ-train-factoid-4b.json").open() as json_file:
  data = json.load(json_file)

In [None]:
questions = data['data'][0]['paragraphs']

In [None]:
def extract_questions_and_answers(factoid_path: Path):
  with factoid_path.open() as json_file:
    data = json.load(json_file)
  
  questions = data['data'][0]['paragraphs']

  data_rows = []

  for question in questions:
    context = question['context']
    for question_and_answers in question['qas']:
      question = question_and_answers['question']
      answers = question_and_answers['answers']
      for answer in answers:
        answer_text = answer['text']
        answer_start = answer['answer_start']
        answer_end  = answer_start + len(answer_text)

        data_rows.append({
            'question': question,
            'context' : context,
            "answer_text" : answer_text,
            "answer_start" : answer_start,
            "answer_end" : answer_end
        })
  return pd.DataFrame(data_rows)

In [None]:
extract_questions_and_answers(Path("BioASQ/BioASQ-train-factoid-4b.json")).head()

Unnamed: 0,question,context,answer_text,answer_start,answer_end
0,What is the inheritance pattern of Li–Fraumeni...,Balanced t(11;15)(q23;q15) in a TP53+/+ breast...,autosomal dominant,213,231
1,What is the inheritance pattern of Li–Fraumeni...,Genetic modeling of Li-Fraumeni syndrome in ze...,autosomal dominant,105,123
2,Which type of lung cancer is afatinib used for?,Clinical perspective of afatinib in non-small ...,EGFR-mutant NSCLC,1203,1220
3,Which hormone abnormalities are characteristic...,"DOCA sensitive pendrin expression in kidney, h...",thyroid,419,426
4,Which hormone abnormalities are characteristic...,Clinical and molecular characteristics of Pend...,thyroid,705,712


In [None]:
factoid_paths = sorted(list(Path("BioASQ/").glob("BioASQ-train-*")))
factoid_paths

[PosixPath('BioASQ/BioASQ-train-factoid-4b.json'),
 PosixPath('BioASQ/BioASQ-train-factoid-5b.json'),
 PosixPath('BioASQ/BioASQ-train-factoid-6b.json')]

In [None]:
dfs = []
for factoid_path in factoid_paths:
  dfs.append(extract_questions_and_answers(factoid_path))
df = pd.concat(dfs)

In [None]:
df

Unnamed: 0,question,context,answer_text,answer_start,answer_end
0,What is the inheritance pattern of Li–Fraumeni...,Balanced t(11;15)(q23;q15) in a TP53+/+ breast...,autosomal dominant,213,231
1,What is the inheritance pattern of Li–Fraumeni...,Genetic modeling of Li-Fraumeni syndrome in ze...,autosomal dominant,105,123
2,Which type of lung cancer is afatinib used for?,Clinical perspective of afatinib in non-small ...,EGFR-mutant NSCLC,1203,1220
3,Which hormone abnormalities are characteristic...,"DOCA sensitive pendrin expression in kidney, h...",thyroid,419,426
4,Which hormone abnormalities are characteristic...,Clinical and molecular characteristics of Pend...,thyroid,705,712
...,...,...,...,...,...
4767,What is the role of TAD protein domain?,Sequestration of p53 in the cytoplasm by adeno...,transactivation domain,765,787
4768,What is the role of TAD protein domain?,Leu628 of the KIX domain of CBP is a key resid...,transactivation domain,139,161
4769,What is the role of TAD protein domain?,Sequestration of p53 in the cytoplasm by adeno...,transactivation domain,765,787
4770,What is the role of TAD protein domain?,Essential roles of Da transactivation domains ...,transcription activation domain,401,432


In [None]:
df = df.drop_duplicates(subset=['context']).reset_index(drop=True)

In [None]:
df.shape

(2582, 5)

In [None]:
len(df.question.unique())

441

In [None]:
sample_question = df.iloc[240]
sample_question

question        What is the characteristic feature of the Dyke...
context         Left hemisphere and male sex dominance of cere...
answer_text                                  cerebral hemiatrophy
answer_start                                                  130
answer_end                                                    150
Name: 240, dtype: object

In [None]:
def color_answer(question):
  answer_start, answer_end = question['answer_start'],question["answer_end"]
  context = question['context']

  return colored(context[:answer_start],'white') + \
    colored(context[answer_start:answer_end + 1],'green') + \
    colored(context[answer_end + 1:],'white')

In [None]:
print(sample_question['question'])
print()
print("Answers")


for wrap in textwrap.wrap(color_answer(sample_question),width=100):
  print(wrap)

What is the characteristic feature of the Dyke-Davidoff-Masson syndrome.

Answers
[37mLeft hemisphere and male sex dominance of cerebral hemiatrophy (Dyke-Davidoff-Masson Syndrome).
Although radiological findings of [0m[32mcerebral hemiatrophy [0m[37m(Dyke-Davidoff-Masson
Syndrome) are well known, there is no systematic study about the gender and the affected side in
this syndrome. Brain images in 26 patients (mean aged 11) with cerebral hemiatrophy were
retrospectively reviewed. Nineteen patients (73.5%) were male and seven patients (26.5%) were
female. Left hemisphere involvement was seen in 18 patients (69.2%) and right hemisphere involvement
was seen in eight patients (30.8%). We conclude that male gender and left side involvement are
frequent in cerebral hemiatrophy disease.[0m


# Tokenization

In [None]:
MODEL_NAME = "t5-base"

In [None]:
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…




In [None]:
encoding = tokenizer(
    sample_question['question'],
    sample_question['context'],
    max_length=396,
    padding="max_length",
    truncation="only_second",
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors="pt"
)

In [None]:
encoding.keys()

dict_keys(['input_ids', 'attention_mask'])

In [None]:
tokenizer.special_tokens_map

{'additional_special_tokens': "['<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42>', '<extra_id_43>', '<extra_id_44>', '<extra_id_45>', '<extra_id_46>', '<extra_id_47>', '<extra_id_48>', '<extra_id_49>', '<extra_id_50>', '<extra_id_51>', '<extra_id_52>', '<extra_id_53>', '<extra_id_54>', '<extra_id_55>', '<extra_id_56>', '<extra_i

In [None]:
tokenizer.eos_token,tokenizer.eos_token_id

('</s>', 1)

In [None]:
tokenizer.decode(encoding['input_ids'].squeeze())

'What is the characteristic feature of the Dyke-Davidoff-Masson syndrome.</s> Left hemisphere and male sex dominance of cerebral hemiatrophy (Dyke-Davidoff-Masson Syndrome). Although radiological findings of cerebral hemiatrophy (Dyke-Davidoff-Masson Syndrome) are well known, there is no systematic study about the gender and the affected side in this syndrome. Brain images in 26 patients (mean aged 11) with cerebral hemiatrophy were retrospectively reviewed. Nineteen patients (73.5%) were male and seven patients (26.5%) were female. Left hemisphere involvement was seen in 18 patients (69.2%) and right hemisphere involvement was seen in eight patients (30.8%). We conclude that male gender and left side involvement are frequent in cerebral hemiatrophy disease.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pa

In [None]:
answer_encoding = tokenizer(
    sample_question['answer_text'],
    max_length=32,
    padding="max_length",
    truncation=True,
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors="pt"

)

In [None]:
tokenizer.decode(answer_encoding['input_ids'].squeeze())

'cerebral hemiatrophy</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'

In [None]:
labels = answer_encoding['input_ids']
labels

tensor([[24387,     3,   107, 11658,    17, 29006,     1,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]])

In [None]:
labels[labels==0] = -100

In [None]:
labels

tensor([[24387,     3,   107, 11658,    17, 29006,     1,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100]])

In [None]:
class BioQADataset(Dataset):
  def __init__(
      self,
      data:pd.DataFrame,
      tokenizer:T5Tokenizer,
      source_max_token_len: int = 256,
      target_max_token_len: int = 32,

      ):
    
    self.data =  data
    self.tokenizer =  tokenizer
    self.source_max_token_len =  source_max_token_len
    self.target_max_token_len =  target_max_token_len


  def __len__(self):
    return len(self.data)

  def __getitem__(self, index: int):
    data_row = self.data.iloc[index]

    source_encoding = tokenizer(
      data_row['question'],
      data_row['context'],
      max_length=self.source_max_token_len,
      padding='max_length',
      truncation="only_second",
      return_attention_mask=True,
      add_special_tokens=True,
      return_tensors="pt"
      )
    
    target_encoding = tokenizer(
      data_row['answer_text'],
      max_length=self.target_max_token_len,
      padding='max_length',
      truncation=True,
      return_attention_mask=True,
      add_special_tokens=True,
      return_tensors="pt"
      )
    
    labels = target_encoding['input_ids']
    labels[labels==0] = -100

    return dict(
        question=data_row['question'],
        context=data_row['context'],
        answer_text=data_row['answer_text'],
        input_ids=source_encoding["input_ids"].flatten(),
        attention_mask=source_encoding['attention_mask'].flatten(),
        labels=labels.flatten()
    )




In [None]:
sample_dataset = BioQADataset(df, tokenizer)

In [None]:
for data in sample_dataset:
  print(data["question"])
  print(data['answer_text'])
  print(data['input_ids'][:10])
  print(data['labels'][:10])
  break

What is the inheritance pattern of Li–Fraumeni syndrome?
autosomal dominant
tensor([  363,    19,     8, 28915,  3275,    13,  1414,   104,   371,  6340])
tensor([ 1510, 10348,   138, 12613,     1,  -100,  -100,  -100,  -100,  -100])


In [None]:
train_df,val_df = train_test_split(df,test_size=0.05)

In [None]:
train_df.shape, val_df.shape

((12338, 5), (650, 5))

In [None]:
class BioDataModule(pl.LightningDataModule):
  def __init__(
      self,
      train_df: pd.DataFrame,
      test_df: pd.DataFrame,
      tokenizer:T5Tokenizer,
      batch_size: int = 8,
      source_max_token_len: int = 256,
      target_max_token_len: int = 32,
      ):
    super().__init__()
    self.train_df = train_df
    self.test_df = test_df
    self.tokenizer = tokenizer
    self.batch_size = batch_size
    self.source_max_token_len = source_max_token_len
    self.target_max_token_len = target_max_token_len

  def setup(self):
    self.train_dataset = BioQADataset(
        self.train_df,
        self.tokenizer,
        self.source_max_token_len,
        self.target_max_token_len
        )

    self.test_dataset = BioQADataset(
    self.test_df,
    self.tokenizer,
    self.source_max_token_len,
    self.target_max_token_len
    )
 
  def train_dataloader(self):
    return DataLoader(
        self.train_dataset,
        batch_size=self.batch_size,
        shuffle=True,
        num_workers=4
        )
  def val_dataloader(self):
    return DataLoader(
        self.test_dataset,
        batch_size=self.batch_size,
        num_workers=4
        )

  def test_dataloader(self):
    return DataLoader(
        self.test_dataset,
        batch_size=1,
        num_workers=4
        )

In [None]:
BATCH_SIZE = 16
N_EPOCHS = 6

data_module = BioDataModule(train_df, val_df, tokenizer, batch_size=BATCH_SIZE)
data_module.setup()

In [None]:
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691430.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at t5-base were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
output = model(
    input_ids=encoding['input_ids'],
    attention_mask=encoding['attention_mask'],
    labels=labels
)

In [None]:
output.logits.shape

In [None]:
output.loss

In [None]:
class BioQAModel(pl.LightningModule):
  def __init__(self):
    super().__init__()
    self.model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)


  def forward(self, input_ids, attention_mask, labels=None):
    output = self.model(
        input_ids, 
        attention_mask=attention_mask,
        labels=labels)

    return output.loss, output.logits

  def training_step(self, batch, batch_idx):
    input_ids = batch['input_ids']
    attention_mask=batch['attention_mask']
    labels = batch['labels']
    loss, outputs = self(input_ids, attention_mask, labels)
    self.log("train_loss", loss, prog_bar=True, logger=True)
    return {"loss": loss, "predictions":outputs, "labels": labels}

  def validation_step(self, batch, batch_idx):
    input_ids = batch['input_ids']
    attention_mask=batch['attention_mask']
    labels = batch['labels']
    loss, outputs = self(input_ids, attention_mask, labels)
    self.log("val_loss", loss, prog_bar=True, logger=True)
    return loss

  def test_step(self, batch, batch_idx):
    input_ids = batch['input_ids']
    attention_mask=batch['attention_mask']
    labels = batch['labels']
    loss, outputs = self(input_ids, attention_mask, labels)
    self.log("test_loss", loss, prog_bar=True, logger=True)
    return loss

  def configure_optimizers(self):

    optimizer = AdamW(self.parameters(), lr=0.0001)
    return optimizer


In [None]:
model = BioQAModel()

Some weights of the model checkpoint at t5-base were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
checkpoint_callback = ModelCheckpoint(
    dirpath="checkpoints",
    filename="best-checkpoint",
    save_top_k=1,
    verbose=True,
    monitor="val_loss",
    mode="min"
)

In [None]:
logger = pl.loggers.TensorBoardLogger('training.logs',name='bio-qa')
trainer = pl.Trainer(
    logger = logger,
    checkpoint_callback=checkpoint_callback,
    max_epochs=N_EPOCHS,
    gpus=1,
    progress_bar_refresh_rate = 30
)

In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir ./lightning_logs

In [None]:
!rm -rf lightning_logs
!rm -rf checkpoints

In [None]:
trainer.fit(model,data_module)


  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…



HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

Epoch 0, global step 306: val_loss reached 0.30702 (best 0.30702), saving model to "/content/drive/My Drive/checkpoints/best-checkpoint.ckpt" as top 1


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

Epoch 1, global step 613: val_loss reached 0.24178 (best 0.24178), saving model to "/content/drive/My Drive/checkpoints/best-checkpoint.ckpt" as top 1


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

Epoch 2, global step 920: val_loss reached 0.21556 (best 0.21556), saving model to "/content/drive/My Drive/checkpoints/best-checkpoint.ckpt" as top 1


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

Epoch 3, global step 1227: val_loss reached 0.20418 (best 0.20418), saving model to "/content/drive/My Drive/checkpoints/best-checkpoint.ckpt" as top 1


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

Epoch 4, step 1534: val_loss was not in top 1


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

Epoch 5, global step 1841: val_loss reached 0.20347 (best 0.20347), saving model to "/content/drive/My Drive/checkpoints/best-checkpoint.ckpt" as top 1





1

In [None]:
trainer.test()

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Testing', layout=Layout(flex='2'), max=…


--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_loss': 0.22696293890476227}
--------------------------------------------------------------------------------


[{'test_loss': 0.22696293890476227}]

In [None]:
trained_model = BioQAModel.load_from_checkpoint("checkpoints/best-checkpoint_bio.ckpt")
trained_model.freeze()

In [None]:
def generate_answer(question):
  source_encoding = tokenizer(
      question['question'],
      question['context'],
      max_length = 396,
      padding='max_length',
      truncation='only_second',
      return_attention_mask=True,
      add_special_tokens=True,
      return_tensors='pt'
  )

  generated_ids = trained_model.model.generate(
      input_ids = source_encoding['input_ids'],
      attention_mask=source_encoding['attention_mask'],
      num_beams=1,
      max_length=80,
      repetition_penalty=2.5,
      length_penalty=1.0,
      early_stopping=True,
      use_cache=True

  )

  preds = [
          tokenizer.decode(generated_id, skip_special_tokens=True,clean_up_tokenization_spaces=True)
          for generated_id in generated_ids
  ]

  return ''.join(preds)

In [None]:
sample_question = val_df.iloc[3]
sample_question

question          The small molecule SEA0400 is an inhibitor of which ion antiporter/exchanger?
context         SEA0400, a specific inhibitor of the Na+-Ca2+ exchanger, attenuates sodium n...
answer_text                                                                                 NCX
answer_start                                                                                212
answer_end                                                                                  215
question_emb    [0.3978716373443604, -0.04826507568359375, -0.5220368862152099, 0.6058694839...
Name: 318, dtype: object

In [None]:
sample_question['question']

'The small molecule SEA0400 is an inhibitor of which ion antiporter/exchanger?'

In [None]:
sample_question['answer_text']

'NCX'

In [None]:
sample_question['context']

'SEA0400, a specific inhibitor of the Na+-Ca2+ exchanger, attenuates sodium nitroprusside-induced apoptosis in cultured rat microglia. 1. Using SEA0400, a potent and selective inhibitor of the Na+-Ca2+ exchanger (NCX), we examined whether NCX is involved in nitric oxide (NO)-induced disturbance of endoplasmic reticulum (ER) Ca2+ homeostasis followed by apoptosis in cultured rat microglia. 2. Sodium nitroprusside (SNP), an NO donor, decreased cell viability in a dose- and time-dependent manner with apoptotic cell death in cultured microglia. 3. Treatment with SNP decreased the ER Ca2+ levels as evaluated by measuring the increase in cytosolic Ca2+ level induced by exposing cells to thapsigargin, an irreversible inhibitor of ER Ca2+-ATPase. 4. The treatment with SNP also increased mRNA expression of CHOP and GPR78, makers of ER stress. 5. SEA0400 at 0.3-1.0 microM protected microglia against SNP-induced apoptosis. 6. SEA0400 blocked not only the SNP-induced decrease in ER Ca2+ levels but

In [None]:
generate_answer(sample_question)

'NCX'

In [None]:
xx=val_df.index[val_df["question"] == 'Which protein is the main marker of Cajal bodies?']
xx[0]

2495

In [None]:
val_df.loc[xx[0]]['context']

'Substrate profiling of human vaccinia-related kinases identifies coilin, a Cajal body nuclear protein, as a phosphorylation target with neurological implications. Protein phosphorylation by kinases plays a central role in the regulation and coordination of multiple biological processes. In general, knowledge on kinase specificity is restricted to substrates identified in the context of specific cellular responses, but kinases are likely to have multiple additional substrates and be integrated in signaling networks that might be spatially and temporally different, and in which protein complexes and subcellular localization can play an important role. In this report the substrate specificity of atypical human vaccinia-related kinases (VRK1 and VRK2) using a human peptide-array containing 1080 sequences phosphorylated in known signaling pathways has been studied. The two kinases identify a subset of potential peptide targets, all of them result in a consensus sequence composed of at leas