# Residue Standardization

## What is the goal of this notebook? 

This is the final step to create our curated ensemble. Now that the chains have been standardized, we are going to make sure that the residues numbers are consistent among all structures. This is done by performing a multiple sequence alignment (MSA) for all the sequences in our ensemble, and then generating a new numbering system based on the MSA. Thus, it is very important for users to inspect the MSA generated during this step. This notebook shows two examples of a TIM dataset and shows the difference between having a single organism dataset as opposed to a multiple organism dataset, and how this affects the MSA. There is not one sole correct way to do this step and the users need to decide the purpose of the analysis they intend to perform to get the results that work best for them. 


>**NOTE 1:** For this tutorial, we will not use the whole ensemble we downloaded. We will use a subsample of only 6 structures. The next cells will create the new directory. Notice that we are choosing these 6 sctructures from the ones we downloaded in Step 0. We chose these ones to highlight some possible issues you may run into when running this script. 

>**NOTE 2:** This is the continuation of step 3


In [1]:
from PDBClean import pdbclean_io

In [2]:
PROJDIR="./TIM/"

In [3]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='standard_ResidueID_bank')

### How to run `PDBClean_ResidueStandardization_CIF.py`

In the terminal, type:

> PDBClean_ResidueStandardization_CIF.py `{Input Directory}` `{Output Directory}`

Select the following choices when prompted in the on screen menus:  

`1) Perform multiple alignments to identify residues` -> 
`1) Show list of chains to be standardized` -> 
`4) Perform multiple alignments`

Wait for MUSCLE to finish running

`3) Perform residue number standardization` \
`Type QUIT`

For this tutorial, you can also run the following cell, which already includes the inputs for these options. 

In [4]:
! echo '1\n1\n4\n3\nQUIT\n' | PDBClean_ResidueStandardization_CIF.py $PROJDIR/standard_ChainID_bank $PROJDIR/standard_ResidueID_bank

Reading: ./TIM//standard_ChainID_bank/2y62+00.cif  (1 of 6)
Reading: ./TIM//standard_ChainID_bank/1ag1+00.cif  (2 of 6)
Reading: ./TIM//standard_ChainID_bank/1aw1+04.cif  (3 of 6)
Reading: ./TIM//standard_ChainID_bank/1aw1+02.cif  (4 of 6)
Reading: ./TIM//standard_ChainID_bank/1aw1+03.cif  (5 of 6)
Reading: ./TIM//standard_ChainID_bank/1aw1+01.cif  (6 of 6)
PDBClean Residue Number Standardization Menu
    Select one of the following options to proceed:
    1) Perform multiple alignments to identify residues
Option Number:     Perform multiple alignments to identify residues
    1) Show list of chains to be standardized
    2) Remove chain IDs from list of chains to be standardized
    3) Input file of chain IDs to remove from list of chains to be standardized
    4) Perform multiple alignments
Option Number: A
B
C
D
    Perform multiple alignments to identify residues
    1) Show list of chains to be standardized
    2) Remove chain IDs from list of chains to be standardized
    3) Inp

### Congratulations you have a curated dataset! But... 

Step 4 generates an ensemble where the residue numbers are consistent among all structures. The output of the function PDBClean_ResidueStandardization_CIF.py is not only the directory with the renumbered structures, but also the alignments used to make the renumbering (saved as fasta files). The alignments are saved in the directory where the code was run. For this example that will be the directory where you store your notebooks. Running the cell below will show you the fasta files generated. We recommend checking if the renumbering matches the users expectations.


In [5]:
! ls | grep fasta

A.fasta
B.fasta
C.fasta
D.fasta


These files can be visualized with software such as [Jalview](https://www.jalview.org/). Below you can see the A.fasta and B.fasta alignments which correspond to the protein chains in the TIM structures. You can refer to Step 2 for notation.

### A.fasta

![TIM A alignment](./images/TIMJalview0_2.png)

### B.fasta

![TIM B alignment](./images/JalviewTIMB_2.png)

Visual inspection of the alignments shows that there are two groups of sequences with high similarity. Jalview offers tools that can help cluster similar sequences. The cells below show the output of the function "Calculate Tree", using neighbour joining, and the Blossom62 matrix, for both A and B alignments.

### A.fasta Tree

<div>
<img src="./images/JalviewTIMTreeA.png" width="400"/>
</div>

>**NOTE:** Even though there are 6 structures in the dataset, there are only 5 sequences in the A.fasta tree. This is because one of the structures is a monomer and only contains chain B.

### B.fasta Tree

![TIM B Tree](./images/JalviewTIMTreeB.png)

These trees show that there are two major groups in the data set. In fact, 4 of the sequences come from the same PDB ID (1aw1). Users may want to seperate these two groups before reassigning residue numbers because 4 of them come from the same organism, Moritella marina, while the other two are the parasites Leishmania mexicana (2y62) and Trypanosoma brucei (1ag1). The cells below demonstrate how to separate the two groups of structures into different directories.

In [6]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='subset4')

In [7]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='subset2')

In [8]:
pdbclean_io.check_project(projdir=PROJDIR+'subset4/', action='create', level='standard_ChainID_bank_subset')

In [9]:
pdbclean_io.check_project(projdir=PROJDIR+'subset4/', action='create', level='standard_ResidueID_bank_subset')

In [10]:
pdbclean_io.check_project(projdir=PROJDIR+'subset2/', action='create', level='standard_ChainID_bank_subset')

In [11]:
pdbclean_io.check_project(projdir=PROJDIR+'subset2/', action='create', level='standard_ResidueID_bank_subset')

In [12]:
! cp $PROJDIR/standard_ChainID_bank/1ag1+00.cif $PROJDIR/subset2/standard_ChainID_bank_subset

In [13]:
! cp $PROJDIR/standard_ChainID_bank/2y62+00.cif $PROJDIR/subset2/standard_ChainID_bank_subset

In [14]:
! cp $PROJDIR/standard_ChainID_bank/1aw1+*.cif $PROJDIR/subset4/standard_ChainID_bank_subset

In [15]:
! ls $PROJDIR/subset4/standard_ChainID_bank_subset

1aw1+01.cif 1aw1+02.cif 1aw1+03.cif 1aw1+04.cif info.txt


In [16]:
! mv *.fasta *.fa $PROJDIR

>Running the cells below will perform Step 4 separately for the two groups we created. Notice that we intentionally move the output files into the newly created directories to avoid them getting overwritten, as we have to run this step multiple times.

In [17]:
! echo '1\n1\n4\n3\nQUIT\n' | PDBClean_ResidueStandardization_CIF.py $PROJDIR/subset4/standard_ChainID_bank_subset $PROJDIR/subset4/standard_ResidueID_bank_subset

Reading: ./TIM//subset4/standard_ChainID_bank_subset/1aw1+04.cif  (1 of 4)
Reading: ./TIM//subset4/standard_ChainID_bank_subset/1aw1+02.cif  (2 of 4)
Reading: ./TIM//subset4/standard_ChainID_bank_subset/1aw1+03.cif  (3 of 4)
Reading: ./TIM//subset4/standard_ChainID_bank_subset/1aw1+01.cif  (4 of 4)
PDBClean Residue Number Standardization Menu
    Select one of the following options to proceed:
    1) Perform multiple alignments to identify residues
Option Number:     Perform multiple alignments to identify residues
    1) Show list of chains to be standardized
    2) Remove chain IDs from list of chains to be standardized
    3) Input file of chain IDs to remove from list of chains to be standardized
    4) Perform multiple alignments
Option Number: A
B
C
D
    Perform multiple alignments to identify residues
    1) Show list of chains to be standardized
    2) Remove chain IDs from list of chains to be standardized
    3) Input file of chain IDs to remove from list of chains to be sta

In [18]:
! mv *.fasta *.fa $PROJDIR/subset4

In [19]:
! echo '1\n1\n4\n3\nQUIT\n' | PDBClean_ResidueStandardization_CIF.py $PROJDIR/subset2/standard_ChainID_bank_subset $PROJDIR/subset2/standard_ResidueID_bank_subset

Reading: ./TIM//subset2/standard_ChainID_bank_subset/2y62+00.cif  (1 of 2)
Reading: ./TIM//subset2/standard_ChainID_bank_subset/1ag1+00.cif  (2 of 2)
PDBClean Residue Number Standardization Menu
    Select one of the following options to proceed:
    1) Perform multiple alignments to identify residues
Option Number:     Perform multiple alignments to identify residues
    1) Show list of chains to be standardized
    2) Remove chain IDs from list of chains to be standardized
    3) Input file of chain IDs to remove from list of chains to be standardized
    4) Perform multiple alignments
Option Number: A
B
C
D
    Perform multiple alignments to identify residues
    1) Show list of chains to be standardized
    2) Remove chain IDs from list of chains to be standardized
    3) Input file of chain IDs to remove from list of chains to be standardized
    4) Perform multiple alignments
Option Number: 
muscle 5.1.osx64 []  8.6Gb RAM, 4 cores
Built Feb 22 2022 02:38:35
(C) Copyright 2004-202

In [20]:
! mv *.fasta *.fa $PROJDIR/subset2

## Results
Now that we have rerun Step 4 for both subsets, we can take a look at the new alignments generated.

### Subset 4 Results

#### TIM A

![Subset 4 TIM A alignment](./images/JalviewSubset4TIMA_2.png)

#### TIM B

![Subset 4 TIM B alignment](./images/JalviewSubset4TIMB_2.png)

The alignments for subset 4 are "perfect", this is not surprising as the four sequences come from the same PDB ID. 

### Subset 2 Results

#### TIM A

![Subset 2 TIM A alignment](./images/JalviewSubset2TIMA_2.png)

#### TIM B

![Subset 2 TIM B alignment](./images/JalviewSubset2TIMA_2.png)

For chain B, we see that one of the files didn't contain this chain. For chain A, we have a very good alignment, even if the sequences are not the exact same. However, it makes sense to keep the same numbering, as the same numbered residues are located in the same regions of the protein. This is further validated when we look at the structure, as shown below. 

![TIM B Structure Alignment](./images/PymolSubset2.png)

In the end, users can decide whether or not they want to seperate their structures into subsets based on what type of analysis they plan to do.

## Next Steps

For this tutorial we used a subset of all the TIM structures. We recommend repeating steps 2-4 for the whole dataset. 
Or change the keyword on step 0, to try a different molecule. 

The next set of notebooks focus on advanced curation steps and analysis. 