# Assign MolID to the entities found in the CIF files (3) 

# What is the goal of this notebook?

This is a continuation from `Assign MolID to the entities found in the CIF files (2)`.
In this notebook we will show how to perform concatenations and conversions if you wish to include your own file. Alongside 'what if's' of situations where there may be missing or same chain ID assignments.  

There is an option available when the environment is being run, where the user is able to input their own file containing the chain ID assignments for every entity. This file would be located within the same directory containing all other directories related to their molecule.

**Note:** Make sure to run part 1 and 2 of this step in advance.

## Using terminal to create file

We use the terminal to make the file utalized in this tutorial. While this tutorial uses Vim editor, the user is welcomed to use any text editor of their choice. Ensure that the file is saved with the '.txt' extension and follows this format:

### File format:
- **Entity Name:Chain ID assignment**

Each entity name and their corresponding chain ID should be on a separate line. The file should contain no whitespaces for the exception of the entity name if it contains more than one word. Additionally, some entity name's will come with special characters and numbers, the user should ensure to copy the entity name as is written on the cif file. 
If the entity name has more than one chain ID assignment, separate each chain ID with a comma.

 - Example:
    - WATER:A,B
    - PHOSPHATE ION:C
    - GLYCEROL:D,E
    - SN-GLYCEROL-3-PHOSPHATE:F

In [1]:
#Importing our library

In [2]:
from PDBClean import pdbclean_io

In [3]:
PROJDIR = "./TIM/"
input_file_path = PROJDIR + "TIM_input_file.txt"

## Creating 'TIM_input_file.txt'

This notebook uses 'TIM_input_file.txt' which is a file containing the chain ID assignments we created in notebook 2.1 which were stored in our 'standard_MolID_bank' directory. We will create this '.txt' file for the user by running the code block below. As mentioned previously, this '.txt' file is saved within the directory created in step 0 for the molecule. In our case, this directory is 'TIM'. For clarification, the file can be any name that the user wishes.

As a reminder, the table below shows us the chain ID assignments we performed in step 2.1

| New chain ID | ENTITIES |
|---|:---|
| A | 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM: |
| B | 1:SN-GLYCEROL-3-PHOSPHATE: |
| C | 1:SN-GLYCEROL-1-PHOSPHATE: |
| D,E | 2:GLYCEROL: |
| F,G | 2:WATER: |
| H,I | 2:TRIOSEPHOSPHATE ISOMERASE: |
| J | 1:PHOSPHATE ION: |
| K,L | 2:2-PHOSPHOGLYCOLIC ACID: | 


In [4]:
# Creating the file for the user
text = """
    TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:A
    SN-GLYCEROL-3-PHOSPHATE:B
    SN-GLYCEROL-1-PHOSPHATE:C
    GLYCEROL:D,E
    WATER:F,G
    TRIOSEPHOSPHATE ISOMERASE:H,I
    PHOSPHATE ION:J
    2-PHOSPHOGLYCOLIC ACID:K,L
    """.strip()

text = "\n".join(line.strip() for line in text.splitlines())

with open(input_file_path, 'w') as file:
    file.write(text)

In [5]:
# Run the code block below to see how 'TIM_input_file.txt' looks like

In [6]:
! cat $input_file_path

TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:A
SN-GLYCEROL-3-PHOSPHATE:B
SN-GLYCEROL-1-PHOSPHATE:C
GLYCEROL:D,E
WATER:F,G
TRIOSEPHOSPHATE ISOMERASE:H,I
PHOSPHATE ION:J
2-PHOSPHOGLYCOLIC ACID:K,L

## Running PDBClean_MolID_CIF.py 

Remember that the way to run this script in the terminal is as following:

> PDBClean_MolID_CIF.py `{Input Directory}` `{Output Directory}`

* The input directory: Directory containing the structures generated in Step 1 
* The output directory: Directory where the new structures will be stored. 


**Note:** We recommend running the script directly on the terminal. We are running it from the notebook just for demonstration purpose.

### Workflow

When running the script, the user will be presented with the same Conversion Build Menu shown in the previous notebooks to step 2. The user will select option `3) Enter input file` and enter the name of their file. Since our Chain ID assignments are complete and have no conflicts, we will be presented with option `7) Continue to next step of curation`, then option `6) Finalize Curation` which finishes our progress.

Summary:

`3) Enter input file` -> `Conversion File: 'TIM_input_file.txt'` -> `7) Continue to next step of curation` -> `6) Finalize Curation`

In [7]:
#Running the command

! echo '3\n'$input_file_path'\n7\n6\n' | PDBClean_MolID_CIF.py $PROJDIR/simple_bank_sub $PROJDIR/standard_MolID_bank

Reading: ./TIM//simple_bank_sub/2y62+00.cif  (1 of 6)
Reading: ./TIM//simple_bank_sub/1ag1+00.cif  (2 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+04.cif  (3 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+02.cif  (4 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+03.cif  (5 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+01.cif  (6 of 6)
PDBClean MolID Conversion Build Menu
             Select one of the following options to proceed:
             1) Show full conversion
             2) Show only unassigned conversions
             3) Enter input file
             4) Search MolID to add chain ID conversion
             5) Go entry by entry to add chain ID conversion
             6) Remove a chain ID conversion
             A) Print entity:file_name list
             B) TEST TRACKING CHAIN-NAME:ENTITY:FILE-NAME
             C) Yet another test. Tracking, similar to B but print only relevant chain names...
             D) Adding original chain names 0_0
          
Option Number: Conversion File: Congratu

## Missing chain ID assignments
What if there were some entities with missing chain ID assignments? Below is a table of what this file would contain if this were the case.

| New chain ID | ENTITIES |
|---|:---|
| A | 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM: |
| B | 1:SN-GLYCEROL-3-PHOSPHATE: |
| C | 1:SN-GLYCEROL-1-PHOSPHATE: |
| D,E | 2:GLYCEROL: |
|  | 2:WATER: |
| A | 2:TRIOSEPHOSPHATE ISOMERASE: |
| J | 1:PHOSPHATE ION: |
| K,L | 2:2-PHOSPHOGLYCOLIC ACID: | 

### Workflow
When running the script, the user will not recieve the 'ready to curate message' due to the missing chain ID assignments. To fix this, the user will:

1. **Enter the file**:
   - Select option `3) Enter input file`, which will similarly prompt them to enter the name of the file.
3. **View chain ID assignments** - select either option:
   - `1) Show full conversion`: Displays to them all the entities and their chain ID assignments in their 'standard_MolID_bank' directory
   - `2) Show only unassigned conversions`: Displays only the entites with missing chain ID assignments.
4. **Search and Assign chain IDs**:
   - Select option `4) Search MolID to add chain ID conversion`. This option works as a search function, which will require the user to input the name of the entity they wish to input the chain ID assignment for.
   - If there's more than one entity containing this name, the program will present them and allow the user to:
     - `1) Further narrow down search results`: This promps the user to re-enter the entity name with more specificity to it.
     - `2) Add chain ID to conversion templates`: this promps user to go straight into adding the chain ID assignment.
5. **Ready to curate**:
   - Once all entities have the right amount of chain IDs assigned to them and none of them have the same chain ID assignments, the user
     will now be presented with the same 'ready to curate' message as seen in the previous step.

Summary:

`3) Enter input file` -> `Conversion File: 'TIM_input_file.txt'` -> `2) Show only unassigned conversions`-> `4) Search MolID to add chain ID conversion` -> `MolID search term: (entity with missing chain ID)` -> `2) Add chain ID to conversion templates` -> `Enter new chain IDs, comma separated, no spaces Chain IDs:(chain ID assignment of choice)` -> `7) Continue to next step of curation` -> `6) Finalize Curation`

In [8]:
text = """
    TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:A
    SN-GLYCEROL-3-PHOSPHATE:B
    SN-GLYCEROL-1-PHOSPHATE:C
    GLYCEROL:D,E
    WATER:
    TRIOSEPHOSPHATE ISOMERASE:A
    PHOSPHATE ION:J
    2-PHOSPHOGLYCOLIC ACID:K,L
    """.strip()

text = "\n".join(line.strip() for line in text.splitlines())

with open(input_file_path, 'w') as file:
    file.write(text)

In [9]:
# Run the code block below to see how 'TIM_input_file.txt' looks like

In [10]:
! cat $input_file_path

TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:A
SN-GLYCEROL-3-PHOSPHATE:B
SN-GLYCEROL-1-PHOSPHATE:C
GLYCEROL:D,E
WATER:
TRIOSEPHOSPHATE ISOMERASE:A
PHOSPHATE ION:J
2-PHOSPHOGLYCOLIC ACID:K,L

In [11]:
! echo '3\n'$input_file_path'\n2\n4\nWATER\n2\nF,G\n4\nTRIOSEPHOSPHATE ISOMERASE\n2\nA2\n1\n7\n6\n' | PDBClean_MolID_CIF.py $PROJDIR/simple_bank_sub $PROJDIR/standard_MolID_bank

Reading: ./TIM//simple_bank_sub/2y62+00.cif  (1 of 6)
Reading: ./TIM//simple_bank_sub/1ag1+00.cif  (2 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+04.cif  (3 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+02.cif  (4 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+03.cif  (5 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+01.cif  (6 of 6)
PDBClean MolID Conversion Build Menu
             Select one of the following options to proceed:
             1) Show full conversion
             2) Show only unassigned conversions
             3) Enter input file
             4) Search MolID to add chain ID conversion
             5) Go entry by entry to add chain ID conversion
             6) Remove a chain ID conversion
             A) Print entity:file_name list
             B) TEST TRACKING CHAIN-NAME:ENTITY:FILE-NAME
             C) Yet another test. Tracking, similar to B but print only relevant chain names...
             D) Adding original chain names 0_0
          
Option Number: Conversion File: PDBClean

## Same chain ID assignments

Now, what if there are some entities with the same chain ID assignments? Below is a table of what this file would look like.

| New chain ID | ENTITIES |
|---|:---|
| A | 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM: |
| D | 1:SN-GLYCEROL-3-PHOSPHATE: |
| D | 1:SN-GLYCEROL-1-PHOSPHATE: |
| D,D | 2:GLYCEROL: |
| C,C | 2:WATER: |
| A,B | 2:TRIOSEPHOSPHATE ISOMERASE: |
| D | 1:PHOSPHATE ION: |
| D,D | 2:2-PHOSPHOGLYCOLIC ACID: | 

In [12]:
# Creating the file with the same chain ID assignments
text = """
    TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:A
    SN-GLYCEROL-3-PHOSPHATE:D
    SN-GLYCEROL-1-PHOSPHATE:D
    GLYCEROL:D,D
    WATER:C,C
    TRIOSEPHOSPHATE ISOMERASE:A,B
    PHOSPHATE ION:D
    2-PHOSPHOGLYCOLIC ACID:D,D
    """.strip()

text = "\n".join(line.strip() for line in text.splitlines())

with open(input_file_path, 'w') as file:
    file.write(text)

In [13]:
# Run the code block below to see how 'TIM_input_file.txt' looks like

In [14]:
! cat $input_file_path

TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:A
SN-GLYCEROL-3-PHOSPHATE:D
SN-GLYCEROL-1-PHOSPHATE:D
GLYCEROL:D,D
WATER:C,C
TRIOSEPHOSPHATE ISOMERASE:A,B
PHOSPHATE ION:D
2-PHOSPHOGLYCOLIC ACID:D,D

### Workflow
As similar to the previous step, The user would select option `3) Enter input file`, which will prompt them to enter the name of the file. They will be presented with option `7) Continue to next step of curation`, however, a message will display informing them that they have assigned similar chain ID assignments to two or more entities. The user is able to fix this by either selecting `5) Accept ALL` which is the option to select all concatenations as is, or option `4) Accept proposed concatenation one by one`, which allows the user to accept each chain ID assignment one by one.

Summary:

`3) Enter input file` -> `Type in the name of your file 'TIM_input_file.txt'` -> `7) Continue to next step of curation` -> `5) Accept ALL (BE CAREFUL, make sure you agree with all concatenations)` **OR** `4) Accept proposed concatenation one by one (Repeat this step until finalizing option appears)` -> `6) Finalize Curation`

In [15]:
! echo '3\n'$input_file_path'\n7\n5\n6\n' | PDBClean_MolID_CIF.py $PROJDIR/simple_bank_sub $PROJDIR/standard_MolID_bank

Reading: ./TIM//simple_bank_sub/2y62+00.cif  (1 of 6)
Reading: ./TIM//simple_bank_sub/1ag1+00.cif  (2 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+04.cif  (3 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+02.cif  (4 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+03.cif  (5 of 6)
Reading: ./TIM//simple_bank_sub/1aw1+01.cif  (6 of 6)
PDBClean MolID Conversion Build Menu
             Select one of the following options to proceed:
             1) Show full conversion
             2) Show only unassigned conversions
             3) Enter input file
             4) Search MolID to add chain ID conversion
             5) Go entry by entry to add chain ID conversion
             6) Remove a chain ID conversion
             A) Print entity:file_name list
             B) TEST TRACKING CHAIN-NAME:ENTITY:FILE-NAME
             C) Yet another test. Tracking, similar to B but print only relevant chain names...
             D) Adding original chain names 0_0
          
Option Number: Conversion File: Congratu