# Chain ID standardization

## What is the goal of this notebook?

We want to make sure that the chains with the same sequence in different cif files are all named the same way. To make that happen, we do pairwise sequence alignment between all sequences and a reference sequence. The reference sequence can be from a given reference structure, or from a consensus sequence (calculated from the given dataset). The ID re-assignment goes as follows:

Chain A of test.cif has a 0.8 similarity score with chain B of the reference structure ref.cif, but only a 0.4 similarity score with chain A of the reference structure rif.cif. Then we will change the name of chain A of test.cif, to Chain B. A message will be printed on screen: 

>Original Chain ID | New Chain ID | Naive Similarity Score \
./test.cif \
A B 0.8

In the current version, we use a naive similarity score. Where we count the fraction of positions that are identical between the test and the reference sequences. 

Also note, there is a possibility of another chain in test.cif, already being labeled as Chain B. In that case, a prompt will be printed on screen, asking us to change the name of the original chain B. Make sure to keep track of the names of the chains. However, at the end of this process, we should have a complete dataset with consistent chain IDs among all structures. 

>**NOTE 1:** For this tutorial, we will not use the whole ensemble we downloaded. We will use a subsample of only 6 structures. The next cells will create the new directory. Notice that we are choosing these 6 sctructures from the ones we downloaded in Step 0. We chose these ones to highlight some possible issues you may run into when running this script. 

>**NOTE 2:** This is the continuation of step 2.2 

In [1]:
## First, import library and setup directories

In [2]:
from PDBClean import pdbclean_io

In [3]:
PROJDIR="./TIM"

In [4]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='standard_ChainID_bank')

### How to run `PDBClean_ChainStandardization_CIF.py`

In the terminal, type:

> PDBClean_ChainStandardization_CIF.py `{Input Directory}` `{Output Directory}`

Select the following choices when prompted in the on screen menus:  

`1) Select Standard Sequences from input structure` -> 
`1) Show list of input structures` -> 
`./TIM/standard_MolID_bank2/1aw1+01.cif` ->
`5) Return to main menu` ->
`4) Perform pairwise alignments against Standard Sequences` ->
`4) Perform pairwise alignments against Standard Sequences and create conversion template`

Wait for MUSCLE to finish running

`5) Perform Standardization of Chain IDs`

For this tutorial, you can also run the following cell, which already includes the inputs for these options. 

In [5]:
! echo '1\n1\n2\n./TIM/standard_MolID_bank2/1aw1+01.cif\n5\n4\n4\n5\n' | PDBClean_ChainStandardization_CIF.py $PROJDIR/standard_MolID_bank2 $PROJDIR/standard_ChainID_bank

Reading: ./TIM/standard_MolID_bank2/2y62+00.cif  (1 of 6)
Reading: ./TIM/standard_MolID_bank2/1ag1+00.cif  (2 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+04.cif  (3 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+02.cif  (4 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+03.cif  (5 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+01.cif  (6 of 6)
PDBClean ChainID Standardization Menu
    Select one of the following options to proceed:
    1) Select Standard Sequences from input structure
    2) Create Standard Sequences from consensus of input structures
Option Number:     Select Standard Sequences from input structure
    1) Show list of input structures
    2) Select input structure
Option Number: ./TIM/standard_MolID_bank2/2y62+00.cif
./TIM/standard_MolID_bank2/1ag1+00.cif
./TIM/standard_MolID_bank2/1aw1+04.cif
./TIM/standard_MolID_bank2/1aw1+02.cif
./TIM/standard_MolID_bank2/1aw1+03.cif
./TIM/standard_MolID_bank2/1aw1+01.cif
    Select Standard Sequences from input structure
    1) 


MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

Seq 2 seqs, max length 254, avg  length 254
00:00:00      1 MB(0%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00      1 MB(0%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00      2 MB(0%)  Iter   1  100.00%  Align node       
00:00:00      2 MB(0%)  Iter   1  100.00%  Root alignment

MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

Seq 2 seqs, max length 255, avg  length 255
00:00:00      1 MB(0%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00      1 MB(0%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00      2 MB(0%)  Iter   1  100.00%  Align node       
00:00:00      2 MB(0%)  Iter   1  100.00%  Root alignment

MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscle
This software is dona

### Final comments on this example

TIM is a homodimer. The monomers are labeled as A and B. It is thus interesting that in the case of `2y62+00.cif` we see a reassignment of the Chain ID. This is because of our naïve similarity score. We are just counting the fraction of identical residues. And in this case, because chain A and chain B in the reference, did not have the same number of residues, there was a difference in the score. 