# Split PDB files into chains using command line

Alternative to `pdbsplit` illustrated [here](https://nbviewer.jupyter.org/github/fomightez/bio3d-binder/blob/master/index.ipynb); the demostration notebook is launchable in active form by clicking on any `launch binder` badge [here](https://github.com/fomightez/bio3d-binder).

If you'd prefer a graphical user environment, as described [here](https://sourceforge.net/p/pymol/mailman/message/30683050/), PyMol has a way to do this and a script is also available (see bottom of that email thread).

-----

Preparation

Get files to use in demonstration. This first one is both protein and RNA, and so it will be a good test of generality.

In [1]:
!curl -OL https://files.rcsb.org/download/6AH3.pdb.gz
!gunzip 6AH3.pdb.gz
!curl -OL https://files.rcsb.org/download/1l0l.pdb.gz
!gunzip 1l0l.pdb.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  519k  100  519k    0     0   399k      0  0:00:01  0:00:01 --:--:--  399k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  351k  100  351k    0     0   947k      0 --:--:-- --:--:-- --:--:--  947k


------



## Bash/sed method

Basics from [here](https://sourceforge.net/p/pymol/mailman/message/30683050/) , see 'Sed method' section below for basics of that part:

In [2]:
%%bash
pdb=6AH3.pdb
for chain in $(grep "^ATOM" $pdb | cut -b 22 | sort -u); do sed -n "/^.\{21\}$chain/p" $pdb > ${pdb%.pdb}_$chain.pdb; done

To take [Tsjerk's advice](https://sourceforge.net/p/pymol/mailman/message/30683050/) and make it a script:

In [3]:
s='''#!/bin/bash
pdb=$1
for chain in $(grep "^ATOM" $pdb | cut -b 22 | sort -u)
do
    sed -n "/^.\{21\}$chain/p" $pdb > ${pdb%.pdb}_$chain.pdb
done'''

%store s > split_into_chains.sh

Writing 's' (str) to file 'split_into_chains.sh'.


Now to use that script.  
(You'd leave out the ! if you were actually running this in a shell terminal.)

In [4]:
!bash split_into_chains.sh 1l0l.pdb

## Sed method

Basics from [here](https://sourceforge.net/p/pymol/mailman/message/30683050/):

In [5]:
%%bash
sed -n "/^.\{21\}A/p" 6AH3.pdb > 6AH3_A.pdb
sed -n "/^.\{21\}B/p" 6AH3.pdb > 6AH3_B.pdb
sed -n "/^.\{21\}C/p" 6AH3.pdb > 6AH3_C.pdb

I thought taking advantage of the Jupyter/Python magic here, I'd be able to combinine the above process more simply into a loop than the Bash method and at least semi automate it. However, even though the next cell works to send a Python variable to a `sed` command, I couldn't get it to work here. I know the issue had something to do with the curly braces were already included in the command but I couldn't fix it despite trying many combinations and advice from [here](https://stackoverflow.com/questions/50649280/jupyter-shell-assignment-passing-variables-to-sed/50649532#50649532) and on pages linked therein. However, the Bash/sed method (see above) automates reading in the actual chain designations and using sed. That is fairly streamlined, and only slightly less clear to read than what I would have managed with Python anyway.

In [6]:
#write example FASTA to file  
s = '''>evoli
atctgatctggggcgaaatgagactgatctgatctggtctgtggcg
'''

!echo "{s}" > S288c.mt.genome.fa
#view start
print ("STARTING FROM:")
!head S288c.mt.genome.fa

# So passing this way to sed works
var = "cerevisiae"
!sed -i "1s/.*/>{var}/" S288c.mt.genome.fa

#view result
print ("AFTER PASS VARIABLE FROM PYTHON TO A sed COMMAND, GET:")
!head S288c.mt.genome.fa

STARTING FROM:
>evoli
atctgatctggggcgaaatgagactgatctgatctggtctgtggcg

AFTER PASS VARIABLE FROM PYTHON TO A sed COMMAND, GET:
>cerevisiae
atctgatctggggcgaaatgagactgatctgatctggtctgtggcg



In [7]:
# ALL THESE FAIL, leaving as a note to self of what I wanted to do and tried (also tried double curly braces [not shown] here which I had seen necessary to get regular expression search terms that featured them from command line into python)
'''chains = ["A","B","C","D","E","F","G","H","I","J","K","T"]
-OR-
import string
chains = list(string.ascii_uppercase)[:11]#based on https://stackoverflow.com/questions/16060899/alphabet-range-python/31888217
chains.append("T")
import re
for chain in chains:
    chain_esc = re.escape(chain)
    #!echo $chain > tst.tst
    fn = "6AH3_{}.pdb".format(chain)
    #!echo $chain >$fn
    #!sed -n "/^.\{21\}{chain}/p" 6AH3.pdb > \{fn\}
    var = !sed -n "/^.\{21\}{chain_esc}/p" 6AH3.pdb
    print (var.n[:100])
    #print (var.n[1:100])
    #!echo {var.n} > {fn}
'''
'''
for chain in chains:
    #!echo {chain} > 6AH3_{chain}.pdb # FAILS DUE TO METHOD TO PASS PYTHON VARIABLE AND SED REGEX OVERLAP, it seems.
    #!sed -n "/^.\{21\}\{chain\}/p" 6AH3.pdb > 6AH3_\{chain\}.pdb # FAILS DUE TO METHOD TO PASS PYTHON VARIABLE AND SED REGEX OVERLAP, it seems.
    chain_esc = re.escape(chain) #based on https://stackoverflow.com/a/50649532/8508004
    #!echo {chain_esc} > 6AH3_{chain}.pdb # FAILS DUE TO METHOD TO PASS PYTHON VARIABLE AND SED REGEX OVERLAP, it seems.
    
    #!sed -n "/^.\{21\}\{chain\}/p" 6AH3.pdb > 6AH3_\{chain\}.pdb #seems to work (I think); combines 
    # https://stackoverflow.com/a/16790880/8508004 and https://stackoverflow.com/a/50649532/8508004
    #!sed -n "/^.\{21\}{chain_esc}/p" 6AH3.pdb > 6AH3_{chain_esc}.pdb # FAILS
    !sed -n "/^.\{21\}{\chain\}/p" 6AH3.pdb > 6AH3_{chain}.pdb
''';

ALong these lines of frustration/puzzlement, was that I could get bash to echo chain (or a python list) with 

    !echo $chain
    !echo {chain}
    !FOO=`python -c 'print (" ".join(["A","B","C","D","E","F","G","H","I","J","K","T"]))'`;echo $FOO #based on https://stackoverflow.com/a/11392201/8508004

But I couldn't work out how to pass that to a BASH varible; all these failed?!?:

    #!sed -n "/^.\{21\}{chain}/p" 6AH3.pdb > \{fn\}
    !python -c 'print (chain_esc)'
    #!bchain=$(echo $chain); echo $bchain
    #!bchain=$(echo {chain}); echo $bchain
    #!bchain=$(echo {chain}); echo $bchain
    #!bchain=$(echo \{chain\}); echo $bchain
    #!bchain=$(echo {{chain}}); echo $bchain
    #!bchain=$(python -c 'print ($chain)'); echo $bchain
    #!bchainecho $chain;echo $bchain
    #!bchain=3;echo $bchain
    #!bchain="{chain}";echo -i $bchain;sed -n "/^.\{21\}$bchain/p" 6AH3.pdb

My idea was that if I could pass it to Bash, I could use the sed command in the 'Bash/sed method'. But I couldn't?!?!
(Note that I could have used `%%bash -s "$myPythonVar" "$myOtherVar"` to definitely pass a python variable into Bash, but that would have locked me into using Bash at that point to make the loop and so I'd be right back to what the 'Bash/sed method' already handles.)
By the way, the [bottom of this page](https://www.tldp.org/LDP/abs/html/varassignment.html) , had good coverage of how var=`echo chain` (NOTE BACKTICKS USED!!!) became var=$(echo chain)

------
Tried this **AWK** one from [here](https://bougui505.github.io/2017/03/15/split_a_multi_chain_pdb_into_one_pdb_file_per_chain_using_awk.html), but I couldn't get it to work yet.

In [8]:
!curl -O https://gist.githubusercontent.com/bougui505/e9cae5e9a8b3c3c4a65e699ab0e0a20e/raw/3baf7ad9eb5b0660d1b4c13c5a3fee6750aecfcb/split_chains.awk

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   703  100   703    0     0   3480      0 --:--:-- --:--:-- --:--:--  3480


In [9]:
!awk split_chains.awk 6AH3.pdb

awk: 1: unexpected character '.'


In [10]:
!awk -f split_chains.awk 6AH3.pdb #based on https://stackoverflow.com/questions/13045110/awk-1-unexpected-character-suddenly-appeared

-----

## Collect files for easy downloading

In [11]:
!tar czf 6AH3_chains.tar.gz 6AH3_*.pdb

In [12]:
!tar czf 1l0l_chains.tar.gz 1l0l_*.pdb

Download the two gzipped tarballed archives to your local machine.