# Residue Standardization

## What is the goal of this notebook? 

This is the final step to create our curated ensemble. Now that the chains have been standardized, we are going to make sure that the residues numbers are consistent among all structures. We do this by performing a multiple sequence alignment (MSA) for all the sequences in our ensemble, and then generating a new numbering system based on the MSA. 

>**NOTE 1:** For this tutorial, we will not use the whole ensemble we downloaded. We will use a subsample of only 6 structures. The next cells will create the new directory. Notice that we are choosing these 6 sctructures from the ones we downloaded in Step 0. We chose these ones to highlight some possible issues you may run into when running this script. 

>**NOTE 2:** This is the continuation of step 3

In [1]:
from PDBClean import pdbclean_io

In [2]:
PROJDIR="./TIM"

In [3]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='standard_ResidueID_bank')

### How to run `PDBClean_ResidueStandardization_CIF.py`

In the terminal, type:

> PDBClean_ResidueStandardization_CIF.py `{Input Directory}` `{Output Directory}`

Select the following choices when prompted in the on screen menus:  

`1) Perform multiple alignments to identify residues` -> 
`1) Show list of chains to be standardized` -> 
`4) Perform multiple alignments`

Wait for MUSCLE to finish running

`3) Perform residue number standardization` \
`Type QUIT`

For this tutorial, you can also run the following cell, which already includes the inputs for these options. 

In [10]:
! echo '1\n1\n4\n3\nQUIT\n' | PDBClean_ResidueStandardization_CIF.py $PROJDIR/standard_ChainID_bank $PROJDIR/standard_ResidueID_bank

Reading: ./TIM/standard_ChainID_bank/2y62+00.cif  (1 of 6)
Reading: ./TIM/standard_ChainID_bank/1ag1+00.cif  (2 of 6)
Reading: ./TIM/standard_ChainID_bank/1aw1+04.cif  (3 of 6)
Reading: ./TIM/standard_ChainID_bank/1aw1+02.cif  (4 of 6)
Reading: ./TIM/standard_ChainID_bank/1aw1+03.cif  (5 of 6)
Reading: ./TIM/standard_ChainID_bank/1aw1+01.cif  (6 of 6)
PDBClean Residue Number Standardization Menu
    Select one of the following options to proceed:
    1) Perform multiple alignments to identify residues
Option Number:     Perform multiple alignments to identify residues
    1) Show list of chains to be standardized
    2) Remove chain IDs from list of chains to be standardized
    3) Input file of chain IDs to remove from list of chains to be standardized
    4) Perform multiple alignments
Option Number: A
B
C
D
    Perform multiple alignments to identify residues
    1) Show list of chains to be standardized
    2) Remove chain IDs from list of chains to be standardized
    3) Input fil

### Congratulations! You have created your first curated ensemble!

For this tutorial we used a subset of all the TIM structures. We recommend repeating steps 2-4 for the whole dataset. 
Or change the keyword on step 0, to try a different molecule. 

The next set of notebooks focus on advanced curation steps and analysis. 