# Assign MolID to the entities found in the CIF files (3) 

# What is the goal of this notebook?

This is a continuation from `Assign MolID to the entities found in the CIF files (2)`.
In this notebook we will show how to perform concatenations and conversions if you wish to include your own file. Alongside 'what if's' of situations where there may be missing or same chain ID assignments.  

There is an option available when the environment is being run, where the user is able to input their own file containing the chain ID assignments for every entity. This file would be located within the same directory containing all other directories related to their molecule.

**Note:** Make sure to run part 1 and 2 of this step in advance.

In [3]:
#Importing our library

In [4]:
from PDBClean import pdbclean_io

In [5]:
PROJDIR = "./TIM/"
input_file_path = PROJDIR + "TIM_input_file.txt"

## Creating 'TIM_input_file.txt'

This notebook uses 'TIM_input_file.txt' which contains the chain ID assignments we created in notebook 2.1 which were stored in our 'standard_MolID_bank' directory. We will create this txt file for the user by running the code block below. As mentioned previously, this txt file is saved within the directory created in step 0 for your molecule. For clarification, the file can be any name that the user wishes.

As a reminder, the table below shows us the chain ID assignments we performed in step 2.1

| New chain ID | ENTITIES |
|---|:---|
| A | 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM: |
| B | 1:SN-GLYCEROL-3-PHOSPHATE: |
| C | 1:SN-GLYCEROL-1-PHOSPHATE: |
| D,E | 2:GLYCEROL: |
| F,G | 2:WATER: |
| H,I | 2:TRIOSEPHOSPHATE ISOMERASE: |
| J | 1:PHOSPHATE ION: |
| K,L | 2:2-PHOSPHOGLYCOLIC ACID: | 


In [7]:
# Creating the file for the user
text = """
    TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:A
    SN-GLYCEROL-3-PHOSPHATE:B
    SN-GLYCEROL-1-PHOSPHATE:C
    GLYCEROL:D,E
    WATER:F,G
    TRIOSEPHOSPHATE ISOMERASE:H,I
    PHOSPHATE ION:J
    2-PHOSPHOGLYCOLIC ACID:K,L
    """.strip()

text = "\n".join(line.strip() for line in text.splitlines())

with open('TIM_input_file.txt', 'w') as file:
    file.write(text)

In [8]:
# To move it to the right directory (for me)
! mv TIM_input_file.txt ~/Internship/PDBClean-0.0.2/Notebooks/TIM

In [9]:
# Let's ensure that everything has been saved accordingly
! cat $input_file_path

TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:A
SN-GLYCEROL-3-PHOSPHATE:B
SN-GLYCEROL-1-PHOSPHATE:C
GLYCEROL:D,E
WATER:F,G
TRIOSEPHOSPHATE ISOMERASE:H,I
PHOSPHATE ION:J
2-PHOSPHOGLYCOLIC ACID:K,L

## Running PDBClean_MolID_CIF.py 

Remember that the way to run this script in the terminal is as following:

> PDBClean_MolID_CIF.py `{Input Directory}` `{Output Directory}`

* The input directory: Directory containing the structures generated in Step 1 
* The output directory: Directory where the new structures will be stored. 


**Note:** We recommend running the script directly on the terminal. We are running it from the notebook just for demonstration purpose.

### Workflow

When running the script, the user will be presented with the same Conversion Build Menu presented in the previous parts to step 2. The user will select option `3) Enter input file` and enter the name of their file. Since our Chain ID assignments are complete and have no conflicts, we will be presented with option `7) Continue to next step of curation`, then option `6) Finalize Curation` which finishes oru progress.

Summary:

`3) Enter input file` -> `Conversion File: 'TIM_input_file.txt'` -> `7) Continue to next step of curation` -> `6) Finalize Curation`

In [12]:
#running the command

! echo '3\nTIM_file_input.txt\n1\n7\n6' | PDBClean_MolID_CIF.py $PROJDIR/simple_bank_sub $PROJDIR/standard_MolID_bank

zsh:1: command not found: PDBClean_MolID_CIF.py


In [13]:
text = """
    TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:
    SN-GLYCEROL-3-PHOSPHATE:
    SN-GLYCEROL-1-PHOSPHATE:
    GLYCEROL:D,E
    WATER:F,G
    TRIOSEPHOSPHATE ISOMERASE:H,I
    PHOSPHATE ION:J
    2-PHOSPHOGLYCOLIC ACID:K,L
    """.strip()

text = "\n".join(line.strip() for line in text.splitlines())

with open('TIM_input_file.txt', 'w') as file:
    file.write(text)

In [14]:
#To move it to the right one (for me)
! mv TIM_input_file.txt ~/Internship/PDBClean-0.0.2/Notebooks/TIM

In [15]:
! echo '3\nTIM_file_input.txt' | PDBClean_MolID_CIF.py $PROJDIR/simple_bank_sub $PROJDIR/standard_MolID_bank

zsh:1: command not found: PDBClean_MolID_CIF.py


You are not greeted with the 'ready to curate message' instead you are presented with the normal menu, click 1) which shows you everything, then 2) shows you which ones are unnasigned,now knowing the entities which are unassigned, you can now  click 4) which lets you search the entity name to then add the chain ID of your choice, here you can click 1 do add a specific chain ID to each 'type'? or 2 to assign the same chain ID to all of them. 

## Same chain ID assignments

Now, what if there are some entities with the same chain ID assignments? Below is a table of what this file would look like.

| New chain ID | ENTITIES |
|---|:---|
| A | 1:TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM: |
| D | 1:SN-GLYCEROL-3-PHOSPHATE: |
| D | 1:SN-GLYCEROL-1-PHOSPHATE: |
| D,D | 2:GLYCEROL: |
| C,C | 2:WATER: |
| A,B | 2:TRIOSEPHOSPHATE ISOMERASE: |
| D | 1:PHOSPHATE ION: |
| D,D | 2:2-PHOSPHOGLYCOLIC ACID: | 

In [18]:
# Creating the file with the same chain ID assignments
text = """
    TRIOSEPHOSPHATE ISOMERASE SYNONYM TRIOSE-PHOSPHATE ISOMERASE, TIM:A
    SN-GLYCEROL-3-PHOSPHATE:D
    SN-GLYCEROL-1-PHOSPHATE:D
    GLYCEROL:D,D
    WATER:C,C
    TRIOSEPHOSPHATE ISOMERASE:A,B
    PHOSPHATE ION:D
    2-PHOSPHOGLYCOLIC ACID:D,D
    """.strip()

text = "\n".join(line.strip() for line in text.splitlines())

with open('TIM_input_file.txt', 'w') as file:
    file.write(text)

In [19]:
#To move it to the right directory (for me)
! mv TIM_input_file.txt ~/Internship/PDBClean-0.0.2/Notebooks/TIM

### Workflow
As similar to the previous step, The user would select option `3) Enter input file`, which will similarly prompt them to enter the name of the file. They will be presented with option `7) Continue to next step of curation`, however, a message will display informing them that they have assigned similar chain ID assignments to two or more entities. The user is able to fix this by either selecting `5) Accept ALL` which is the option to select all concatenations as is, or option `4) Accept proposed concatenation one by one`, which allows the user to accept each concatenation one by one.

Summary:

`3) Enter input file` -> `Type in the name of your file 'TIM_input_file.txt'` -> `7) Continue to next step of curation` -> `5) Accept ALL (BE CAREFUL, make sure you agree with all concatenations)` **OR** `4) Accept proposed concatenation one by one (Repeat this step until finalizing option appears)` -> `6) Finalize Curation`

In [21]:
! echo '3\nTIM_file_input.txt\n1\n7\n5\n6' | PDBClean_MolID_CIF.py $PROJDIR/simple_bank_sub $PROJDIR/standard_MolID_bank

zsh:1: command not found: PDBClean_MolID_CIF.py
