# Assign MolID to the entities found in the CIF files (1) 

## What is the goal of this notebook?

We will run `PDBClean_MolID_CIF.py` to re-assign the MolID to the entities found in our new ensemble of CIF files. 
The script goes over all the CIF files and collects all entities. The user can then decide what MolID to assign them. 

There are also some other benefits from running this script: 

- You can assign the same MolID to different entities. In that case these entities will be concatenated. User need to accept each concatenation manually. 
- Inspecting the list of entities will allow users to identify structures that need to be removed from the ensemble.
- Make sure that the MolIDs of the structures in the ensemble are consistent (the same chain is named always the same, even in different structures).

This notebook will go over the cases described above. 

>**NOTE:** For this tutorial, we will not use the whole ensemble we downloaded. We will use a subsample of only 7 structures. The next cells will create the new directory. Notice that we are choosing these 7 sctructures from the ones we downloaded. We chose these ones to highlight some possible issues you may run into when running this script.

In [1]:
from PDBClean import pdbclean_io

In [2]:
PROJDIR="./TIM"

In [3]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='simple_bank_sub')

> Let's copy some structures from our simple_bank into the newly created 'simple_bank_sub' directory

In [4]:
!cp $PROJDIR/simple_bank/1klg+00.cif $PROJDIR/simple_bank_sub/
!cp $PROJDIR/simple_bank/2y62+00.cif $PROJDIR/simple_bank_sub/
!cp $PROJDIR/simple_bank/1ag1+00.cif $PROJDIR/simple_bank/1aw1+01.cif $PROJDIR/simple_bank_sub/
!cp $PROJDIR/simple_bank/1aw1+02.cif $PROJDIR/simple_bank/1aw1+03.cif $PROJDIR/simple_bank_sub/
!cp $PROJDIR/simple_bank/1aw1+04.cif $PROJDIR/simple_bank_sub/

In [5]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='standard_MolID_bank')

## Running PDBClean_MolID_CIF.py 

Notice that the way to run this script in the terminal is as following:

> PDBClean_MolID_CIF.py `{Input Directory}` `{Output Directory}`

The input directory contains the structures that we generated in Step 1. The output directory is where the new structures will be stored. 

Running this script will print a menu to screen. In the next cell we run the script and give 2 as input, so that we can select option `2) Show only unassigned conversions`. Then we `QUIT` the program. 

**Note:** We recommend running the script directly on the terminal. We are running it from the notebook just for demonstration purpose.

In [6]:
! echo '2\nQUIT' | PDBClean_MolID_CIF.py $PROJDIR/simple_bank_sub $PROJDIR/standard_MolID_bank


Reading: ./TIM/simple_bank_sub/2y62+00.cif  (1 of 7)
Reading: ./TIM/simple_bank_sub/1ag1+00.cif  (2 of 7)
Reading: ./TIM/simple_bank_sub/1klg+00.cif  (3 of 7)
Reading: ./TIM/simple_bank_sub/1aw1+04.cif  (4 of 7)
Reading: ./TIM/simple_bank_sub/1aw1+02.cif  (5 of 7)
Reading: ./TIM/simple_bank_sub/1aw1+03.cif  (6 of 7)
Reading: ./TIM/simple_bank_sub/1aw1+01.cif  (7 of 7)
PDBClean MolID Conversion Build Menu
             Select one of the following options to proceed:
             1) Show full conversion
             2) Show only unassigned conversions
             3) Enter input file
             4) Search MolID to add chain ID conversion
             5) Go entry by entry to add chain ID conversion
             6) Remove a chain ID conversion
          
Option Number: 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:
1:SN-GLYCEROL-3-PHOSPHATE:
1:SN-GLYCEROL-1-PHOSPHATE:
1:GLYCEROL:
4:WATER:
2:TRIOSEPHOSPHATE ISOMERASE:
1:PHOSPHATE ION:
1:HLA CLASS II HISTOCOMPATIBILITY 

## What does the output mean?

`1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:
1:SN-GLYCEROL-3-PHOSPHATE:
1:SN-GLYCEROL-1-PHOSPHATE:
1:GLYCEROL:
4:WATER:
2:TRIOSEPHOSPHATE ISOMERASE:
1:PHOSPHATE ION:
1:HLA CLASS II HISTOCOMPATIBILITY ANTIGEN, DR ALPHA CHAIN:
1:HLA CLASS II HISTOCOMPATIBILITY ANTIGEN, DR-1 BETA CHAIN:
1:TRIOSEPHOSPHATE ISOMERASE PEPTIDE:
1:ENTEROTOXIN TYPE C-3:
2:2-PHOSPHOGLYCOLIC ACID`


The output printed to screen, and reproduced right above in this cell, tells us how many MolIDs (think of them as chains) are part of each entity. For example, the first line tells us that in one of the file, there is one entity `TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM` that contains one MolID. We also see that in the case of `WATER`, there are 4 MolIDs that we need to assign. 



## Inspect the entities in your ensemble. A way to detect outliers:

Another advantage of reading this list, is that we can take a look at all the entities that are present in our ensemble. In our tutorial example, we used the keyword 'triosephosphate isomerase'. If you read this list, you may find some suspicious entitities, such as `HLA CLASS II HISTOCOMPATIBILITY ANTIGEN, DR ALPHA CHAIN`. A closer inspection to the list, we can see also `TRIOSEPHOSPHATE ISOMERASE PEPTIDE`, which suggests that it only contains a fragment of the protein. 

Since these are suspicious entries, we can further inspect the CIF files that contain these entities. First, we need to figure out which are the CIF files. The next cell shows a way to do it:

In [7]:
! grep "HLA CLASS II HISTOCOMPATIBILITY ANTIGEN" $PROJDIR/simple_bank_sub/*cif 
! grep "TRIOSEPHOSPHATE ISOMERASE PEPTIDE" $PROJDIR/simple_bank_sub/*cif 

./TIM/simple_bank_sub/1klg+00.cif:1 'HLA CLASS II HISTOCOMPATIBILITY ANTIGEN, DR ALPHA CHAIN'
./TIM/simple_bank_sub/1klg+00.cif:2 'HLA CLASS II HISTOCOMPATIBILITY ANTIGEN, DR-1 BETA CHAIN'
./TIM/simple_bank_sub/1klg+00.cif:3 'TRIOSEPHOSPHATE ISOMERASE PEPTIDE'


These entities come from one single CIF file: 1klg+00.cif 

By reading the CIF file (run the cell below, removing the '#') , or using a molecular visualization tool, the user can see that this is an outlier. It was selected because there is a small fragment of the triosephosphate isomerase, but the main structure is of the HLA Class II Histocompatibility antigen. It is best to remove these structures from our ensemble. 

In [8]:
# ! cat $PROJDIR/simple_bank_sub/1klg+00.cif

In [9]:
# Remove problematic CIF file

! rm $PROJDIR/simple_bank_sub/1klg+00.cif 


## How to assign new MolID? 

Let's rerun `PDBClean_MolID_CIF.py` with our subsampled ensemble, now with only 6 structures. 

In [10]:
! echo '2\nQUIT' | PDBClean_MolID_CIF.py $PROJDIR/simple_bank_sub $PROJDIR/standard_MolID_bank

Reading: ./TIM/simple_bank_sub/2y62+00.cif  (1 of 6)
Reading: ./TIM/simple_bank_sub/1ag1+00.cif  (2 of 6)
Reading: ./TIM/simple_bank_sub/1aw1+04.cif  (3 of 6)
Reading: ./TIM/simple_bank_sub/1aw1+02.cif  (4 of 6)
Reading: ./TIM/simple_bank_sub/1aw1+03.cif  (5 of 6)
Reading: ./TIM/simple_bank_sub/1aw1+01.cif  (6 of 6)
PDBClean MolID Conversion Build Menu
             Select one of the following options to proceed:
             1) Show full conversion
             2) Show only unassigned conversions
             3) Enter input file
             4) Search MolID to add chain ID conversion
             5) Go entry by entry to add chain ID conversion
             6) Remove a chain ID conversion
          
Option Number: 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:
1:SN-GLYCEROL-3-PHOSPHATE:
1:SN-GLYCEROL-1-PHOSPHATE:
1:GLYCEROL:
2:WATER:
2:TRIOSEPHOSPHATE ISOMERASE:
1:PHOSPHATE ION:
2:2-PHOSPHOGLYCOLIC ACID:
PDBClean MolID Conversion Build Menu
             Select one 

### Renaming MolID, how to choose a name? 

This is a personal decision. You can decide how name each entity. For example, the easiest way is to assign a different MolID to each entity, as shown in the table below:

| New MolID | ENTITIES |
|---|:---|
| A | 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM: |
| B | 1:SN-GLYCEROL-3-PHOSPHATE: |
| C | 1:SN-GLYCEROL-1-PHOSPHATE: |
| D | 1:GLYCEROL: |
| E,F | 2:WATER: |
| G,H | 2:TRIOSEPHOSPHATE ISOMERASE: |
| I | 1:PHOSPHATE ION: |
| J,K | 2:2-PHOSPHOGLYCOLIC ACID: | 


We need to input the new assignment manually when it is printed on screen. Notice that in the next cell, `echo` allows us to type the input in advance. 

`2) Show only unassigned conversions` -> `5) Go entry by entry to add chain ID conversion` -> `Letters we chose on the table in this cell` -> `7) Continue to next step of curation` -> `6) Finalize Curation`


In [11]:
! echo '2\n5\nA\nB\nC\nD\nE,F\nG,H\nI\nJ,K\n7\n6\n' | PDBClean_MolID_CIF.py $PROJDIR/simple_bank_sub $PROJDIR/standard_MolID_bank

Reading: ./TIM/simple_bank_sub/2y62+00.cif  (1 of 6)
Reading: ./TIM/simple_bank_sub/1ag1+00.cif  (2 of 6)
Reading: ./TIM/simple_bank_sub/1aw1+04.cif  (3 of 6)
Reading: ./TIM/simple_bank_sub/1aw1+02.cif  (4 of 6)
Reading: ./TIM/simple_bank_sub/1aw1+03.cif  (5 of 6)
Reading: ./TIM/simple_bank_sub/1aw1+01.cif  (6 of 6)
PDBClean MolID Conversion Build Menu
             Select one of the following options to proceed:
             1) Show full conversion
             2) Show only unassigned conversions
             3) Enter input file
             4) Search MolID to add chain ID conversion
             5) Go entry by entry to add chain ID conversion
             6) Remove a chain ID conversion
          
Option Number: 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:
1:SN-GLYCEROL-3-PHOSPHATE:
1:SN-GLYCEROL-1-PHOSPHATE:
1:GLYCEROL:
2:WATER:
2:TRIOSEPHOSPHATE ISOMERASE:
1:PHOSPHATE ION:
2:2-PHOSPHOGLYCOLIC ACID:
PDBClean MolID Conversion Build Menu
             Select one 