# Assign MolID to the entities found in the CIF files (2) 

## What is the goal of this notebook?

This is a continuation from `Assign MolID to the entities found in the CIF files (1)`.
In this notebook we will show what happens when you assign the same name to different chains, because you want to concatenate them. For example, if you want to make all the waters or ions be in the same chain. 

**Note:** Make sure to run part 1 of this step in advance.

In [1]:
## First, import library and setup directories

In [2]:
from PDBClean import pdbclean_io

In [3]:
PROJDIR="./TIM"

In [4]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='simple_bank_sub2')

In [5]:
# Let's copy the same structures we selected on step 2.1 

! cp $PROJDIR/simple_bank_sub/*cif $PROJDIR/simple_bank_sub2/

In [6]:
pdbclean_io.check_project(projdir=PROJDIR, action='create', level='standard_MolID_bank2')

### Running PDBClean_MolID_CIF.py

Remember that the way to run this script in the terminal is as following:

> PDBClean_MolID_CIF.py `{Input Directory}` `{Output Directory}`

The input directory contains the structures that we generated in Step 1. The output directory is where the new structures will be stored. 

### Renaming MolID, how to choose a name? 

This is a personal decision. You can decide how name each entity. In part 2.1 we assigned a different MolID to each entity, as shown in the table below:

| New MolID | ENTITIES |
|---|:---|
| A | 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM: |
| B | 1:SN-GLYCEROL-3-PHOSPHATE: |
| C | 1:SN-GLYCEROL-1-PHOSPHATE: |
| D | 1:GLYCEROL: |
| E,F | 2:WATER: |
| G,H | 2:TRIOSEPHOSPHATE ISOMERASE: |
| I | 1:PHOSPHATE ION: |
| J,K | 2:2-PHOSPHOGLYCOLIC ACID: | 


For this example, let's try assigning the same MolID to different entities: 

| New MolID | ENTITIES |
|---|:---|
| A | 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM: |
| D | 1:SN-GLYCEROL-3-PHOSPHATE: |
| D | 1:SN-GLYCEROL-1-PHOSPHATE: |
| D | 1:GLYCEROL: |
| C,C | 2:WATER: |
| A,B | 2:TRIOSEPHOSPHATE ISOMERASE: |
| D | 1:PHOSPHATE ION: |
| D,D | 2:2-PHOSPHOGLYCOLIC ACID: | 



In [7]:
! echo '2\n5\nA\nD\nD\nD\nC,C\nA,B\nD\nD,D\n7\n2\nQUIT\n' | PDBClean_MolID_CIF.py $PROJDIR/simple_bank_sub2 $PROJDIR/standard_MolID_bank2

Reading: ./TIM/simple_bank_sub2/2y62+00.cif  (1 of 6)
Reading: ./TIM/simple_bank_sub2/1ag1+00.cif  (2 of 6)
Reading: ./TIM/simple_bank_sub2/1aw1+04.cif  (3 of 6)
Reading: ./TIM/simple_bank_sub2/1aw1+02.cif  (4 of 6)
Reading: ./TIM/simple_bank_sub2/1aw1+03.cif  (5 of 6)
Reading: ./TIM/simple_bank_sub2/1aw1+01.cif  (6 of 6)
PDBClean MolID Conversion Build Menu
             Select one of the following options to proceed:
             1) Show full conversion
             2) Show only unassigned conversions
             3) Enter input file
             4) Search MolID to add chain ID conversion
             5) Go entry by entry to add chain ID conversion
             6) Remove a chain ID conversion
          
Option Number: 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:
1:SN-GLYCEROL-3-PHOSPHATE:
1:SN-GLYCEROL-1-PHOSPHATE:
1:GLYCEROL:
2:WATER:
2:TRIOSEPHOSPHATE ISOMERASE:
1:PHOSPHATE ION:
2:2-PHOSPHOGLYCOLIC ACID:
PDBClean MolID Conversion Build Menu
             Selec

## A pause to explain what is going on:

Notice that a new menu appears when we assign the same MolID to more than one entity. We need to either give a new MolID to the entities, or accept a concatenation. We want to guarantee that you did not assign the same MolID by mistake, so you need to approve each case one by one. 

In the cell above, we chose option `2) Show only unaccepted concatenations`. Let's take a look at the output:

`
./TIM/simple_bank_sub2/2y62+00.cif:SN-GLYCEROL-3-PHOSPHATE:hetG3PA:D:1
./TIM/simple_bank_sub2/2y62+00.cif:SN-GLYCEROL-1-PHOSPHATE:het1GPA:D:2
./TIM/simple_bank_sub2/2y62+00.cif:GLYCEROL:hetGOLA:D:3
./TIM/simple_bank_sub2/1ag1+00.cif:WATER:hetHOHO:C:1
./TIM/simple_bank_sub2/1ag1+00.cif:WATER:hetHOHT:C:2
./TIM/simple_bank_sub2/1aw1+04.cif:2-PHOSPHOGLYCOLIC ACID:hetPGAJ:D:1
./TIM/simple_bank_sub2/1aw1+04.cif:2-PHOSPHOGLYCOLIC ACID:hetPGAK:D:2
./TIM/simple_bank_sub2/1aw1+04.cif:WATER:hetHOHJ:C:1
./TIM/simple_bank_sub2/1aw1+04.cif:WATER:hetHOHK:C:2
./TIM/simple_bank_sub2/1aw1+02.cif:2-PHOSPHOGLYCOLIC ACID:hetPGAD:D:1
./TIM/simple_bank_sub2/1aw1+02.cif:2-PHOSPHOGLYCOLIC ACID:hetPGAE:D:2
./TIM/simple_bank_sub2/1aw1+02.cif:WATER:hetHOHD:C:1
./TIM/simple_bank_sub2/1aw1+02.cif:WATER:hetHOHE:C:2
./TIM/simple_bank_sub2/1aw1+03.cif:2-PHOSPHOGLYCOLIC ACID:hetPGAG:D:1
./TIM/simple_bank_sub2/1aw1+03.cif:2-PHOSPHOGLYCOLIC ACID:hetPGAH:D:2
./TIM/simple_bank_sub2/1aw1+03.cif:WATER:hetHOHG:C:1
./TIM/simple_bank_sub2/1aw1+03.cif:WATER:hetHOHH:C:2
./TIM/simple_bank_sub2/1aw1+01.cif:2-PHOSPHOGLYCOLIC ACID:hetPGAA:D:1
./TIM/simple_bank_sub2/1aw1+01.cif:2-PHOSPHOGLYCOLIC ACID:hetPGAB:D:2
./TIM/simple_bank_sub2/1aw1+01.cif:WATER:hetHOHA:C:1
./TIM/simple_bank_sub2/1aw1+01.cif:WATER:hetHOHB:C:2
`

Notice that the format is: 

`file name` : `entity` : `New MolID we just assigned`: `order of entity with same MolID in CIF file`

To continue running the script, you will need to accept each of these concatenations. For this notebook we only show how to accept three of the proposed concatenations. We recommend doing this step on the terminal, and approve each concatenation one by one. Choosing menu `6) Accept proposed concatenation one by one` will print one of the concatenations that stills need to be approved. A new menu will appear, we need to choose option `2) Accept planned concatenation`. This will bring us back to the concatenation menu. We need to repeat this step (choose option 6, and then 2), until the finalize option appears. 

Once all concatenations have been accepted, an option to finalize the curation will appear.


In [8]:
! echo "2\n5\nA\nD\nD\nD\nC,C\nA,B\nD\nD,D\n7\n2\n6\n2\n6\n2\n6\n2\nQUIT\n" | PDBClean_MolID_CIF.py $PROJDIR/simple_bank_sub2 $PROJDIR/standard_MolID_bank2

Reading: ./TIM/simple_bank_sub2/2y62+00.cif  (1 of 6)
Reading: ./TIM/simple_bank_sub2/1ag1+00.cif  (2 of 6)
Reading: ./TIM/simple_bank_sub2/1aw1+04.cif  (3 of 6)
Reading: ./TIM/simple_bank_sub2/1aw1+02.cif  (4 of 6)
Reading: ./TIM/simple_bank_sub2/1aw1+03.cif  (5 of 6)
Reading: ./TIM/simple_bank_sub2/1aw1+01.cif  (6 of 6)
PDBClean MolID Conversion Build Menu
             Select one of the following options to proceed:
             1) Show full conversion
             2) Show only unassigned conversions
             3) Enter input file
             4) Search MolID to add chain ID conversion
             5) Go entry by entry to add chain ID conversion
             6) Remove a chain ID conversion
          
Option Number: 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:
1:SN-GLYCEROL-3-PHOSPHATE:
1:SN-GLYCEROL-1-PHOSPHATE:
1:GLYCEROL:
2:WATER:
2:TRIOSEPHOSPHATE ISOMERASE:
1:PHOSPHATE ION:
2:2-PHOSPHOGLYCOLIC ACID:
PDBClean MolID Conversion Build Menu
             Selec