# Chain ID standardization

## What is the goal of this notebook?

In some cases, entities can be mislabeled, and we want to make sure that the chains with the same sequence in different cifs are all named the same way. To make that happen, we do pairwise sequence alignment between all sequences and a reference sequence. The reference sequences can come from a structure provided by the user, or from a representative sequence calculated based of the dataset provided by the user. In the current version, we use a naive similarity score. Where we count the fraction of positions that are identical between the test and the reference sequences. Then we use the Hospital-Resident matching algorithm to assign the chain ids. 

Besides the new cifs, a file with the results will also be generated. It will show the new assignments, as well as the similarity scores. 

> File Name |Original Chain ID | New Chain ID | Naive Similarity Score \
\
./test.cif:A:B:0.8 \
./test.cif:B:A:0.7

It is a good idea to look at these scores, as this may help identify possible outliers in the dataset. 


>**NOTE 1:** For this tutorial, we will not use the whole ensemble we downloaded. We will use a subsample of only 6 structures. The next cells will create the new directory. Notice that we are choosing these 6 sctructures from the ones we downloaded in Step 0. We chose these ones to highlight some possible issues you may run into when running this script. 

>**NOTE 2:** This is the continuation of step 2.2 

In [1]:
## First, import library and setup directories

In [2]:
from PDBClean import pdbclean_io

In [3]:
PROJDIR="./TIM"

In [4]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='standard_ChainID_bank')

### How to run `PDBClean_ChainStandardization_CIF.py`

In the terminal, type:

> PDBClean_ChainStandardization_CIF.py `{Input Directory}` `{Output Directory}`

Select the following choices when prompted in the on screen menus:  

>`2) Generate Standard Sequences based on all the input structures` -> 
`4) Generate Standard Sequences based on all input structures` -> `Wait for menu to appear` ->
`4) Perform Standardization of Chain IDs` ->
`4) Perform pairwise alignments against Standard Sequences and rename chains (if needed)`

For this tutorial, you can also run the following cell, which already includes the inputs for these options. 

In [5]:
! echo '2\n4\n4\n4\n' | PDBClean_ChainStandardization_CIF.py $PROJDIR/standard_MolID_bank2 $PROJDIR/standard_ChainID_bank

Reading: ./TIM/standard_MolID_bank2/2y62+00.cif  (1 of 6)
Reading: ./TIM/standard_MolID_bank2/1ag1+00.cif  (2 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+04.cif  (3 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+02.cif  (4 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+03.cif  (5 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+01.cif  (6 of 6)
PDBClean ChainID Standardization Menu
    Select one of the following options to proceed:
    1) Select Standard Sequences from a chosen input structure
    2) Generate Standard Sequences based on all the input structures
Option Number:     Generate Standard Sequences based on all the input structures.
    Type QUIT to return to the main menu.
    1) Show list of chain IDs for Standard Sequences
    2) Enter chain IDs to remove from list
    3) Input file with list of chain IDs to remove
    4) Generate Standard Sequences based on all input structures
    5) Load previously generated standard sequences
Option Number: A
this_chansseq_set
{'SKPQP


muscle 5.1.osx64 []  17.2Gb RAM, 8 cores
Built Feb 22 2022 02:38:35
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com



Input: 2 seqs, avg length 252, max 255

00:00 2.1Mb  CPU has 8 cores, running 8 threads
00:00 2.3Mb   100.0% Calc posteriors
00:00 5.1Mb   100.0% UPGMA5         
Chains already completed:
['A']
Now I am working on this chain:
B

muscle 5.1.osx64 []  17.2Gb RAM, 8 cores
Built Feb 22 2022 02:38:35
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com



Input: 2 seqs, avg length 252, max 254

00:00 2.1Mb  CPU has 8 cores, running 8 threads
00:00 2.3Mb   100.0% Calc posteriors
00:00 5.1Mb   100.0% UPGMA5         
Chains already completed:
['A', 'B']
Just finished with this structure:
['./TIM/standard_MolID_bank2/2y62+00.cif']
These are the results:
{'A': 'B'}
{'A': '0.38132295719844356'}
structid_list
['./TIM/standard_MolID_bank2/2y62+00.cif']
my file name
./TIM/standard_MolID_bank2/2y62+00.cif
Original Chain ID | New Chain ID | Naive Similarity Score 


Chains already completed:
['A']
Now I am working on this chain:
B

muscle 5.1.osx64 []  17.2Gb RAM, 8 cores
Built Feb 22 2022 02:38:35
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com



Input: 2 seqs, avg length 254, max 254

00:00 2.0Mb  CPU has 8 cores, running 8 threads
00:00 2.3Mb   100.0% Calc posteriors
00:00 5.1Mb   100.0% UPGMA5         
Chains already completed:
['A', 'B']
Just finished with this structure:
['./TIM/standard_MolID_bank2/1aw1+03.cif']
These are the results:
{'A': 'A', 'B': 'B'}
{'A': '1.0', 'B': '1.0'}
structid_list
['./TIM/standard_MolID_bank2/1aw1+03.cif']
my file name
./TIM/standard_MolID_bank2/1aw1+03.cif
Original Chain ID | New Chain ID | Naive Similarity Score 
./TIM/standard_MolID_bank2/1aw1+03.cif
A A 1.0
B B 1.0
new_pdb_out
./TIM/standard_ChainID_bank/1aw1+03.cif
I am starting to work on:
['./TIM/standard_MolID_bank2/1aw1+01.cif']
this is chid_seq_map and length
{'A': 'RHPVVMGNWKLNGSKEMVVDLLNGLNAELEGVTGVDVAVAPPALFVDLAERTLTEAGSAIILGAQNTDLNNSG

### Visualize results 

In [6]:
! cat $PROJDIR'/standard_ChainID_bank/ChainStandardizationRecord.txt'

./TIM/standard_MolID_bank2/2y62+00.cif:A:B:0.38132295719844356
./TIM/standard_MolID_bank2/1ag1+00.cif:A:B:0.3852140077821012
./TIM/standard_MolID_bank2/1ag1+00.cif:B:A:0.38372093023255816
./TIM/standard_MolID_bank2/1aw1+04.cif:A:A:1.0
./TIM/standard_MolID_bank2/1aw1+04.cif:B:B:1.0
./TIM/standard_MolID_bank2/1aw1+02.cif:A:A:1.0
./TIM/standard_MolID_bank2/1aw1+02.cif:B:B:1.0
./TIM/standard_MolID_bank2/1aw1+03.cif:A:A:1.0
./TIM/standard_MolID_bank2/1aw1+03.cif:B:B:1.0
./TIM/standard_MolID_bank2/1aw1+01.cif:A:A:1.0
./TIM/standard_MolID_bank2/1aw1+01.cif:B:B:1.0


### Final comments on this example

Note how the reference structure is chosen. For each chain present in our dataset, the sequence that is chosen is the one that appears the most number of times in our dataset. 

TIM is a homodimer. The monomers are labeled as A and B. It is thus interesting that in the case of `2y62+00.cif` we see a reassignment of the Chain ID. This is because of our naïve similarity score. We are just counting the fraction of identical residues. And in this case, because chain A and chain B in the reference, did not have the same number of residues, there was a difference in the scores.

In this example, we can also see that `2y62+00` and `1ag1+00` have low scores (0.38). We can see that this is because they come from different organisms than the reference structure (taken from the Vibrio marinus' TIM, from 1aw1 structures which are overly represented in our dataset): 

>2y62:Leishmania mexicana\
1ag1:Trypanosoma brucei\
1aw1:Vibrio marinus

The user can also load consensus sequences previously generated. For example, they can create a set of reference sequences from particular organisms, and then do the reassignment with a dataset of structures from different organisms. In the next notebook we will show an example of this function. 


