# Chain ID standardization

## What is the goal of this notebook?

This is a continuation of the notebook for Step 3.1. 

Sometimes, entities can be mislabeled, and we want to make sure that the chains with the same sequence in different cifs are all named the same way. To make that happen, we do pairwise sequence alignment between all sequences and a reference sequence. In this notebook, we will show how to generate the reference sequences, and how to load it for analysis of new structures. 

>**NOTE:** This is just an example, but we recommend using this tool when dealing with bigger datasets (not just 6 structures as in our dataset). This notebook intends to show you some of the issues you may run into when using this approach. 


## First, import library and setup directories

In [1]:
from PDBClean import pdbclean_io

In [2]:
PROJDIR="./CAS9/"

In [3]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='standard_ChainID_bank_2')

### What do you need to generate reference sequences? 

We need a set of structures. We can decide if we want to use all the structures that we have curated so far, or maybe the user would want to use structures coming only from a particular organism (e.g. E. coli). For this tutorial we will use a subset of the structures we have analyzed until now. The next two cells create a new directory and copy only two of the CIFs from our dataset.

In [4]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='standard_MolID_bank3')

In [5]:
!cp $PROJDIR/standard_MolID_bank2/2y62+00.cif  $PROJDIR/standard_MolID_bank3/
!cp $PROJDIR/standard_MolID_bank2/1ag1+00.cif  $PROJDIR/standard_MolID_bank3/

### How to generate a reference sequence using `PDBClean_ChainStandardization_CIF.py`

We can use `PDBClean_ChainStandardization_CIF.py` to generate the reference sequences. 

In the terminal, type:

> PDBClean_ChainStandardization_CIF.py `{Input Directory}` `{Output Directory}`

Select the following choices when prompted in the on screen menus:  

>`2) Generate Standard Sequences based on all the input structures` -> 
`4) Generate Standard Sequences based on all input structures` -> `QUIT`

The chosen reference sequences will be printed to screen.

Not only that, all the possible sequences identfied in our dataset will be also printed, as well as showing how prevalent are those sequences in our structures dataset (see number after "Number of structures:"). The "reference" will be the longest sequence that is repeated the most number of times. 

We suggest to take some time inspecting these results. It can help you identify any outlier that has sift through the previous curation steps. It can also show you how diverse your dataset is.




In [6]:
! echo '2\n4\nQUIT\n' | PDBClean_ChainStandardization_CIF.py $PROJDIR/standard_MolID_bank3 $PROJDIR/standard_ChainID_bank_2

Reading: ./TIM/standard_MolID_bank3/2y62+00.cif  (1 of 2)
Reading: ./TIM/standard_MolID_bank3/1ag1+00.cif  (2 of 2)
PDBClean ChainID Standardization Menu
    Select one of the following options to proceed:
    1) Select Standard Sequences from a chosen input structure
    2) Generate Standard Sequences based on all the input structures
Option Number:     Generate Standard Sequences based on all the input structures.
    Type QUIT to return to the main menu.
    1) Show list of chain IDs for Standard Sequences
    2) Enter chain IDs to remove from list
    3) Input file with list of chain IDs to remove
    4) Generate Standard Sequences based on all input structures
    5) Load previously generated standard sequences
Option Number: A
this_chansseq_set
{'SKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASL

### How to save reference sequence:

The output of the cell above shows at the bottom the following text:

>These are the standard sequences:\
{'A': 'SKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSL\
PILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAY\
EPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ', \
'B': 'SKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDF\
GVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGT\
GKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ'}

The user need to manually copy this dictionary into a new text file. The cell below can help the user to do this step for this tutorial. In the folliowing cell, you can see that the text after `echo` is the dictionary with the reference sequences. In this step the user can also decide to use a different sequence for any chain. 

>**Note:** We need to modify the dictionary to change the single quote to double quote, the following cells will show the user how to do this step. This is important, as othwerwise the dictionary won't be loaded properly in the next step. 


In [6]:
! echo "{'A': 'SKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ', 'B': 'SKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ'}" > ./TIM/refseqs.txt

In [7]:
!cat $PROJDIR/refseqs.txt

{'A': 'SKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ', 'B': 'SKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ'}


In [8]:
! sed $'s/\'/"/g' $PROJDIR/refseqs.txt > tmp
! mv tmp $PROJDIR/refseqs.txt

In [9]:
!cat $PROJDIR/refseqs.txt

{"A": "SKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ", "B": "SKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ"}


Now we have a file that contains the reference sequences (refseqs.txt). The next step will show the user how to load this file. 

### Load reference sequences

Now that we have created a text file with the reference sequences, it is time to load them before doing the sequence alignments. 

In the terminal, type:

> PDBClean_ChainStandardization_CIF.py `{Input Directory}` `{Output Directory}`

Select the following choices when prompted in the on screen menus:  

>`2) Generate Standard Sequences based on all the input structures` -> 
`5) Load previously generated standard sequences` -> `/TIM/refseqs.txt` -> 
`4) Perform Standardization of Chain IDs` -> `4) Perform pairwise alignments against Standard Sequences and rename chains (if needed)`

**Notice** that even though we created the references with only 2 structures, we are going to use it for our dataset of 6 structures. In practice we would suggest to do this in the reverse way, that means we can create a reference based on a big dataset (do this once), and then run this script with smaller datasets. This way you can run the script in parallel, and speed up this step. It is also useful if you had already curated a dataset, and want to include newly released structures. 


In [11]:
! echo '2\n5\n./TIM/refseqs.txt\n4\n4\n' | PDBClean_ChainStandardization_CIF.py $PROJDIR/standard_MolID_bank2 $PROJDIR/standard_ChainID_bank_2

Reading: ./TIM/standard_MolID_bank2/2y62+00.cif  (1 of 6)
Reading: ./TIM/standard_MolID_bank2/1ag1+00.cif  (2 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+04.cif  (3 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+02.cif  (4 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+03.cif  (5 of 6)
Reading: ./TIM/standard_MolID_bank2/1aw1+01.cif  (6 of 6)
PDBClean ChainID Standardization Menu
    Select one of the following options to proceed:
    1) Select Standard Sequences from a chosen input structure
    2) Generate Standard Sequences based on all the input structures
Option Number:     Generate Standard Sequences based on all the input structures.
    Type QUIT to return to the main menu.
    1) Show list of chain IDs for Standard Sequences
    2) Enter chain IDs to remove from list
    3) Input file with list of chain IDs to remove
    4) Generate Standard Sequences based on all input structures
    5) Load previously generated standard sequences
Option Number: File with dictionary of the

00:00 7.2Mb   100.0% UPGMA5         

muscle 5.1.osx64 []  8.6Gb RAM, 4 cores
Built Feb 22 2022 02:38:35
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com



Input: 2 seqs, avg length 252, max 254

00:00 4.1Mb  CPU has 4 cores, running 4 threads
00:00 4.3Mb   100.0% Calc posteriors
00:00 7.1Mb   100.0% UPGMA5         
Chains already completed:
['A', 'B']
Just finished with this structure:
['./TIM/standard_MolID_bank2/1aw1+04.cif']
These are the results:
{'A': 'B', 'B': 'A'}
{'A': '0.38372093023255816', 'B': '0.3852140077821012'}
structid_list
['./TIM/standard_MolID_bank2/1aw1+04.cif']
my file name
./TIM/standard_MolID_bank2/1aw1+04.cif
Original Chain ID | New Chain ID | Naive Similarity Score 
./TIM/standard_MolID_bank2/1aw1+04.cif
A B 0.38372093023255816
B A 0.3852140077821012
new_pdb_out
./TIM/standard_ChainID_bank_2/1aw1+04.cif
I am starting to work on:
['./TIM/standard_MolID_bank2/1aw1+02.cif']
this is chid_seq_map and length
{'A': 'RHPVVMGNWKLNGSKEMVVDLLNGLNAELEGVTGVDVAV


muscle 5.1.osx64 []  8.6Gb RAM, 4 cores
Built Feb 22 2022 02:38:35
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com



Input: 2 seqs, avg length 252, max 255

00:00 4.1Mb  CPU has 4 cores, running 4 threads
00:00 4.3Mb   100.0% Calc posteriors
00:00 7.1Mb   100.0% UPGMA5         

muscle 5.1.osx64 []  8.6Gb RAM, 4 cores
Built Feb 22 2022 02:38:35
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com



Input: 2 seqs, avg length 252, max 254

00:00 4.1Mb  CPU has 4 cores, running 4 threads
00:00 4.2Mb   100.0% Calc posteriors
00:00 7.1Mb   100.0% UPGMA5         
Chains already completed:
['A']
Now I am working on this chain:
B

muscle 5.1.osx64 []  8.6Gb RAM, 4 cores
Built Feb 22 2022 02:38:35
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com



Input: 2 seqs, avg length 252, max 254

00:00 4.2Mb  CPU has 4 cores, running 4 threads
00:00 4.3Mb   100.0% Calc posteriors
00:00 7.1Mb   100.0% UPGMA5         

muscle 5.1.osx64 []  8.6Gb RAM, 4 cores
Built Feb 22 2

### Visualize results

In [12]:
! cat $PROJDIR'/standard_ChainID_bank_2/ChainStandardizationRecord.txt'

./TIM/standard_MolID_bank2/2y62+00.cif:A:A:0.7028112449799196
./TIM/standard_MolID_bank2/1ag1+00.cif:A:A:1.0
./TIM/standard_MolID_bank2/1ag1+00.cif:B:B:1.0
./TIM/standard_MolID_bank2/1aw1+04.cif:A:B:0.38372093023255816
./TIM/standard_MolID_bank2/1aw1+04.cif:B:A:0.3852140077821012
./TIM/standard_MolID_bank2/1aw1+02.cif:A:B:0.38372093023255816
./TIM/standard_MolID_bank2/1aw1+02.cif:B:A:0.3852140077821012
./TIM/standard_MolID_bank2/1aw1+03.cif:A:B:0.38372093023255816
./TIM/standard_MolID_bank2/1aw1+03.cif:B:A:0.3852140077821012
./TIM/standard_MolID_bank2/1aw1+01.cif:A:B:0.38372093023255816
./TIM/standard_MolID_bank2/1aw1+01.cif:B:A:0.3852140077821012


### Final comments on this example

We can compare with the results from our previous notebook:
    

In [13]:
! cat $PROJDIR'/standard_ChainID_bank/ChainStandardizationRecord.txt'

./TIM/standard_MolID_bank2/2y62+00.cif:A:B:0.38132295719844356
./TIM/standard_MolID_bank2/1ag1+00.cif:A:B:0.3852140077821012
./TIM/standard_MolID_bank2/1ag1+00.cif:B:A:0.38372093023255816
./TIM/standard_MolID_bank2/1aw1+04.cif:A:A:1.0
./TIM/standard_MolID_bank2/1aw1+04.cif:B:B:1.0
./TIM/standard_MolID_bank2/1aw1+02.cif:A:A:1.0
./TIM/standard_MolID_bank2/1aw1+02.cif:B:B:1.0
./TIM/standard_MolID_bank2/1aw1+03.cif:A:A:1.0
./TIM/standard_MolID_bank2/1aw1+03.cif:B:B:1.0
./TIM/standard_MolID_bank2/1aw1+01.cif:A:A:1.0
./TIM/standard_MolID_bank2/1aw1+01.cif:B:B:1.0


We can see that the results from our previous notebook are better, as we were actually using a sequence that was the most representative of the dataset we intended to analyze. Our example is also an extreme case, as TIM is a homodimer, so either chain could be named A or B, and the assignment will depend on small differences among chains (maybe some residues missing in one of the chains). We want to highlight the importance of knowing your system, the dataset you are using, and what your motivation is when curating these structures. 

Also, as mentioned previously, we suggest to use this tool to get the reference sequences based on the complete data set, and then run this step in batches, in parallel, to speed up this step. 

In some cases, you may also not want to include certain chains in the analysis. For example, if only one or two structure have a particular chain, you may want to remove such chain from the analysis, as this will increase the speed of this process significantly. You only need to provide a text file, where in each line you write the name of the chain you want to exclude. Provide the file name when the following option is prompted:

>`3) Input file with list of chain IDs to add to ignore list` 

