# Working with PDBePISA interface lists/reports in Jupyter Basics and filtering to nucleic acid chains

Usually you'll want to get some data from PDBePISA and analyze it. For the current example in this series of notebooks, I'll cover how to bring in a file listing interface details for a macromolecular complex and then progress through using that in combination with Python to analyze the results and ultimately compare the results to a different structure.

-----

<div class="alert alert-block alert-warning">
<p>If you haven't used one of these notebooks before, they're basically web pages in which you can write, edit, and run live code. They're meant to encourage experimentation, so don't feel nervous. Just try running a few cells and see what happens!.</p>

<p>
    Some tips:
    <ul>
        <li>Code cells have boxes around them. When you hover over them an <i class="fa-step-forward fa"></i> icon appears.</li>
        <li>To run a code cell either click the <i class="fa-step-forward fa"></i> icon, or click on the cell and then hit <b>Shift+Enter</b>. The <b>Shift+Enter</b> combo will also move you to the next cell, so it's a quick way to work through the notebook.</li>
        <li>While a cell is running a <b>*</b> appears in the square brackets next to the cell. Once the cell has finished running the asterisk will be replaced with a number.</li>
        <li>In most cases you'll want to start from the top of notebook and work your way down running each cell in turn. Later cells might depend on the results of earlier ones.</li>
        <li>To edit a code cell, just click on it and type stuff. Remember to run the cell once you've finished editing.</li>
    </ul>
</p>
</div>

----

## Demonstrating the script to make a dataframe from  PDBePISA interface lists/reports

### Preparation: Fetch the script.

The script is stored on Github and running the next cell will bring a copy of it to the work directory here.

In [6]:
# Get a file if not yet retrieved / check if file exists
import os
file_needed = "pisa_interface_list_to_df.py"
if not os.path.isfile(file_needed):
    !curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbepisa-utilities/{file_needed}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 42992  100 42992    0     0   226k      0 --:--:-- --:--:-- --:--:--  226k


### Using the script as you would on the command line.

In [1]:
%run pisa_interface_list_to_df.py 4fgf

Output()

The script saves the dataframe produced in a compressed format Python can recognize. We'll read that back in a a view it here. (Below, we'll demonstrate using the main function right in the notebook and then we can skip this step.)

In [2]:
import pandas as pd
df = pd.read_pickle("4fgf_PISAinterface_summary_pickled_df.pkl")
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 6_level_1,Chain label,SymOp,SymID,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1,A,31,9,6453,x,A,"x-1,y,z",1_455,38,15,6453,340.7,-1.4,0.359,4,2,0,0.0
1,2,A,22,6,6453,x,A,"x,y-1,z+1",1_546,34,11,6453,257.9,2.2,0.685,1,3,0,0.0
2,3,A,22,8,6453,x,A,"x,y-1,z",1_545,17,6,6453,167.5,-0.7,0.438,2,0,0,0.0
3,4,A,19,4,6453,x,A,"x,y,z-1",1_554,17,5,6453,146.3,0.0,0.477,2,4,0,0.0
4,5,[BME]A:149,4,1,208,◊,A,"x,y,z",1_555,13,6,6453,107.4,-0.6,0.358,2,0,0,0.073
5,6,[SO4]A:147,5,1,185,f,A,"x,y,z",1_555,18,5,6453,101.4,-14.0,0.853,5,0,0,0.783
6,7,[BME]A:148,4,1,208,◊,A,"x,y,z",1_555,10,4,6453,72.4,-2.5,0.139,1,0,0,0.0
7,8,A,10,4,6453,f,[BME]A:148,"x,y-1,z",1_545,4,1,208,63.1,0.5,0.465,0,0,0,0.0
8,9,A,4,2,6453,x,A,"x-1,y,z+1",1_456,6,3,6453,53.9,1.4,0.794,1,2,0,0.0
9,10,[SO4]A:147,4,1,185,f,[BME]A:148,"x,y-1,z",1_545,4,1,208,37.7,-3.5,0.747,0,0,0,0.167


You can see the results above. You'll need to scroll to the right to see it all unless you have a really wide screen. There's a lot of columns.

**IMPORTANTLY:**  
On your own machine, outside of Jupyter (or IPython), you'd replace `%run` with `python` (or perhaps `python3`, depending on your Python installation). So using the script on a terminal, would look something like:

```shell
python pisa_interface_list_to_df.py 4fgf
```

Note: you'd have to have the script placed in that working directory. If it's not already there, you may be able to use the following command to get it:

```shell
curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbepisa-utilities/pisa_interface_list_to_df.py
```

----

### Using the main function imported into Python

Import the main function of the script.

In [3]:
from pisa_interface_list_to_df import pisa_interface_list_to_df

Since it is import into the current notebook namespace, we can use the main function here and assign the output to a variable and display the result without needing to read in the file intermediate.

In [4]:
dfb = pisa_interface_list_to_df('1trn')
dfb

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,Id,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 7_level_1,Chain label,SymOp,SymID,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1.0,1.0,A,68.0,19.0,9795.0,x,A,"-y,x,z",3_555,81.0,20.0,9795,726.0,-3.3,0.416,8,1,0,0.0
1,,2.0,B,79.0,19.0,9608.0,x,B,"-y+1,x,z",3_655,65.0,20.0,9608,698.3,-2.2,0.553,10,2,0,0.0
2,,,,,,,,,,,,,**_Average:_**,712.1,-2.7,0.484,9,2,0,0.0
3,2.0,3.0,B,26.0,11.0,9608.0,◊,A,"x,y,z",1_555,29.0,13.0,9795,238.1,-1.8,0.399,2,0,0,0.0
4,3.0,4.0,A,27.0,12.0,9795.0,◊,B,"x,y,z-1",1_554,25.0,8.0,9608,216.4,-3.3,0.13,3,0,0,0.0
5,4.0,5.0,B,26.0,9.0,9608.0,◊,A,"-y+1,x,z+1",3_656,19.0,6.0,9795,185.2,-0.8,0.505,3,1,0,0.0
6,5.0,6.0,[ISP]A:301,7.0,1.0,264.0,cf,A,"x,y,z",1_555,30.0,14.0,9795,173.7,1.5,0.627,3,0,0,0.1
7,,7.0,[ISP]B:301,7.0,1.0,265.0,cf,B,"x,y,z",1_555,29.0,13.0,9608,163.6,1.1,0.586,3,0,0,0.1
8,,,,,,,,,,,,,**_Average:_**,168.6,1.3,0.606,3,0,0,0.1
9,6.0,8.0,B,14.0,8.0,9608.0,◊,A,"-y+1,x,z",3_655,10.0,6.0,9795,103.1,2.1,0.878,2,2,0,0.0


Note the file intermediate still gets made so that it can be saved and the stored dataframe read in and used elsewhere without needing to run the script again from the complete start.


------

## Using local data from PDBePISA that you obtained from copying text off a page to make a dataframe with the script

The script should also be useable to convert text hand-copied for most interface reports/ the list of interactions from PDBePISA for as it was developed to do that in a nascent form.

This section will illustrate an example of that.

Got PISA Interface List by hand copying it from 'Interfaces' page I got when analysing that PDB entry at https://www.ebi.ac.uk/msd-srv/prot_int/cgi-bin/piserver , which you get to by pressing `Launch PDBePISA` at https://www.ebi.ac.uk/pdbe/pisa/pistart.html .

After launch, entered `6agb` and brought up the Interface report by pressing 'Interfaces' near bottom. And then used the mouse to highlight the table and copied the text.  
Then pasted that between the sets of ticks below to then run that cell and assign that string to the variable `s`.

In [5]:
s =''' ## 	 Structure 1 	 × 	 Structure 2 	 interface 
 area, Å2 	 ΔiG 
 kcal/mol 	 ΔiG 
 P-value 	 NHB 	 NSB 	 NDS 	 CSS 
 NN 	 «» 	 Range 	 iNat 	 iNres 	 Surface Å2 	 Range 	 iNat 	 iNres 	 Surface Å2 
 1 		B	 726 	 187 	 45544 	 ◊ 	A	 880 	 117 	 67556 	  7353.3 	  -72.7 	 0.996 	 95 	 0 	 0 	 0.000 
 2 		G	 238 	 65 	 9059 	 ◊ 	A	 266 	 32 	 67556 	  2325.0 	  -24.6 	 0.739 	 23 	 0 	 0 	 0.000 
 3 		D	 244 	 65 	 17023 	 ◊ 	J	 222 	 56 	 16869 	  2290.1 	  -20.0 	 0.174 	 18 	 3 	 0 	 0.000 
 4 		C	 207 	 52 	 12572 	 ◊ 	K	 222 	 58 	 10410 	  2114.0 	  -25.4 	 0.157 	 15 	 0 	 0 	 0.000 
 5 		D	 173 	 48 	 17023 	 ◊ 	A	 231 	 40 	 67556 	  1691.5 	  -18.9 	 0.920 	 24 	 0 	 0 	 0.000 
 6 		D	 154 	 39 	 17023 	 ◊ 	K	 151 	 34 	 10410 	  1475.6 	  -13.5 	 0.211 	 13 	 2 	 0 	 0.000 
 7 		E	 123 	 34 	 9604 	 ◊ 	I	 129 	 30 	 13215 	  1257.1 	  -17.9 	 0.058 	 7 	 0 	 0 	 0.000 
 8 		B	 130 	 39 	 45544 	 ◊ 	G	 126 	 33 	 9059 	  1234.7 	  -10.2 	 0.217 	 10 	 8 	 0 	 0.000 
 9 		E	 106 	 30 	 9604 	 ◊ 	A	 138 	 22 	 67556 	  1186.0 	  -9.8 	 0.916 	 23 	 0 	 0 	 0.000 
 10 		F	 119 	 32 	 9964 	 ◊ 	A	 134 	 23 	 67556 	  1120.8 	  -10.9 	 0.755 	 14 	 0 	 0 	 0.000 
 11 		B	 88 	 21 	 45544 	 ◊ 	E	 123 	 33 	 9604 	  1013.5 	  -16.5 	 0.026 	 7 	 1 	 0 	 0.000 
 12 		B	 119 	 30 	 45544 	 ◊ 	D	 84 	 18 	 17023 	  917.6 	  -11.5 	 0.106 	 4 	 0 	 0 	 0.000 
 13 		H	 94 	 26 	 9292 	 ◊ 	J	 81 	 20 	 16869 	  852.1 	  -7.2 	 0.289 	 9 	 0 	 0 	 0.000 
 14 		H	 81 	 20 	 9292 	 ◊ 	I	 96 	 25 	 13215 	  841.1 	  -8.9 	 0.214 	 4 	 4 	 0 	 0.000 
 15 		F	 91 	 27 	 9964 	 ◊ 	G	 94 	 26 	 9059 	  835.7 	  -9.9 	 0.100 	 13 	 0 	 0 	 0.000 
 16 		K	 93 	 27 	 10410 	 ◊ 	A	 93 	 17 	 67556 	  775.2 	  -9.1 	 0.750 	 5 	 0 	 0 	 0.000 
 17 		E	 69 	 17 	 9604 	 ◊ 	J	 62 	 15 	 16869 	  619.6 	  0.1 	 0.662 	 8 	 4 	 0 	 0.000 
 18 		I	 56 	 17 	 13215 	 ◊ 	A	 54 	 10 	 67556 	  475.6 	  -1.9 	 0.778 	 6 	 0 	 0 	 0.000 
 19 		F	 28 	 9 	 9964 	 ◊ 	I	 33 	 14 	 13215 	  316.5 	  -2.0 	 0.361 	 2 	 2 	 0 	 0.000 
 20 		J	 30 	 9 	 16869 	 ◊ 	A	 34 	 12 	 67556 	  307.0 	  -4.7 	 0.449 	 7 	 0 	 0 	 0.000 
 21 		H	 30 	 7 	 9292 	 ◊ 	A	 41 	 9 	 67556 	  304.2 	  -6.2 	 0.537 	 6 	 0 	 0 	 0.000 
 22 		E	 35 	 12 	 9604 	 ◊ 	H	 35 	 10 	 9292 	  283.0 	  -4.2 	 0.275 	 4 	 0 	 0 	 0.000 
 23 		C	 22 	 6 	 12572 	 ◊ 	D	 26 	 7 	 17023 	  226.6 	  -3.6 	 0.280 	 3 	 0 	 0 	 0.000 
 24 		B	 22 	 6 	 45544 	 ◊ 	I	 25 	 7 	 13215 	  207.1 	  -3.1 	 0.245 	 1 	 0 	 0 	 0.000 
 25 		C	 17 	 8 	 12572 	 ◊ 	A	 28 	 5 	 67556 	  190.7 	  -2.2 	 0.677 	 1 	 0 	 0 	 0.000 
 26 		B	 19 	 6 	 45544 	 ◊ 	F	 19 	 4 	 9964 	  171.3 	  -0.8 	 0.428 	 3 	 0 	 0 	 0.000 
 27 		E	 18 	 5 	 9604 	 ◊ 	G	 20 	 6 	 9059 	  155.9 	  -1.8 	 0.279 	 5 	 0 	 0 	 0.000 
 28 		K	 2 	 2 	 10410 	 ◊ 	[ZN]K:201	 1 	 1 	 98 	  49.0 	  -39.1 	 0.000 	 0 	 0 	 0 	 0.000 
 29 		E	 1 	 1 	 9604 	 ◊ 	F	 1 	 1 	 9964 	  7.7 	  0.2 	 0.797 	 0 	 0 	 0 	 0.000'''

Next, we use some Jupyter magic commad `%store` to save the string as a file.

In [6]:
%store s >"6agb_interface_list.txt"

Writing 's' (str) to file '6agb_interface_list.txt'.


We can see the content assigned  has been saved as a file by running the next cell to see the first few lines.

In [7]:
!head 6agb_interface_list.txt

 ## 	 Structure 1 	 × 	 Structure 2 	 interface 
 area, Å2 	 ΔiG 
 kcal/mol 	 ΔiG 
 P-value 	 NHB 	 NSB 	 NDS 	 CSS 
 NN 	 «» 	 Range 	 iNat 	 iNres 	 Surface Å2 	 Range 	 iNat 	 iNres 	 Surface Å2 
 1 		B	 726 	 187 	 45544 	 ◊ 	A	 880 	 117 	 67556 	  7353.3 	  -72.7 	 0.996 	 95 	 0 	 0 	 0.000 
 2 		G	 238 	 65 	 9059 	 ◊ 	A	 266 	 32 	 67556 	  2325.0 	  -24.6 	 0.739 	 23 	 0 	 0 	 0.000 
 3 		D	 244 	 65 	 17023 	 ◊ 	J	 222 	 56 	 16869 	  2290.1 	  -20.0 	 0.174 	 18 	 3 	 0 	 0.000 
 4 		C	 207 	 52 	 12572 	 ◊ 	K	 222 	 58 	 10410 	  2114.0 	  -25.4 	 0.157 	 15 	 0 	 0 	 0.000 
 5 		D	 173 	 48 	 17023 	 ◊ 	A	 231 	 40 	 67556 	  1691.5 	  -18.9 	 0.920 	 24 	 0 	 0 	 0.000 


Now if we call the script to run it with the PDB code 6agb, it will first check if there's a file `6agb_interface_list.txt` in the working directory and use that if there is.

In [8]:
%run pisa_interface_list_to_df.py 6agb

In [9]:
import pandas as pd
dfl = pd.read_pickle("6agb_PISAinterface_summary_pickled_df.pkl")

In [10]:
dfl

Unnamed: 0_level_0,Unnamed: 1_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 6_level_1,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1,B,726,187,45544,◊,A,880,117,67556,7353.3,-72.7,0.996,95,0,0,0.0
1,2,G,238,65,9059,◊,A,266,32,67556,2325.0,-24.6,0.739,23,0,0,0.0
2,3,D,244,65,17023,◊,J,222,56,16869,2290.1,-20.0,0.174,18,3,0,0.0
3,4,C,207,52,12572,◊,K,222,58,10410,2114.0,-25.4,0.157,15,0,0,0.0
4,5,D,173,48,17023,◊,A,231,40,67556,1691.5,-18.9,0.92,24,0,0,0.0
5,6,D,154,39,17023,◊,K,151,34,10410,1475.6,-13.5,0.211,13,2,0,0.0
6,7,E,123,34,9604,◊,I,129,30,13215,1257.1,-17.9,0.058,7,0,0,0.0
7,8,B,130,39,45544,◊,G,126,33,9059,1234.7,-10.2,0.217,10,8,0,0.0
8,9,E,106,30,9604,◊,A,138,22,67556,1186.0,-9.8,0.916,23,0,0,0.0
9,10,F,119,32,9964,◊,A,134,23,67556,1120.8,-10.9,0.755,14,0,0,0.0


------

### Using the produced dataframe

I'll do some cursory steps with the dataframes to give you a flavor of what can be done. More will be done in other notebooks in this series.

*to be done*

--------

The script demonstrated above will get the interface reports/ the list of interactions from PDBePISA, if you already don't have it, and then generate a Pandas dataframe with that information.  


However, you may just a smaller piece of information, or information elsewhere on the interactions page and be wondering how you can get that?  
Or just wondering more about what is going on in the retrieval step of the script demonstrated above?  
Read on...

### Retrieving interface reports/ the list of interactions by hand

Say you are used to pressing `Launch PDBePISA` at [PDBePISA](https://www.ebi.ac.uk/pdbe/pisa/pistart.html) and then using then entering PDB indentifier code for the Protein Data Bank entry in which you are interested. That's great when you are interested in one or two. But when you want to scale up, you need to be able to do that programmatically.

PDBePISA offers three utilities helpful for accessing PDBePISA programmatically.

There's the [PDBe REST API - Programmatic access to PDBe data](https://www.ebi.ac.uk/pdbe/api/doc/pisa.html) for certain queries such as getting returned the "number of interfaces for a given pdbid/assemblyid."

There's also URLs that you can use to get XML of interfaces and description of assemblies, and even "Coordinate (PDB-formatted) files of macromolecular assemblies:", as described [here](https://www.ebi.ac.uk/pdbe/pisa/pi_download.html) under 'Download PISA Data'.

Finally, on **yet another page** of the PDBePISA documentation about getting data from the service, they describe under a page entitled [Linking to PISA](https://www.ebi.ac.uk/pdbe/pisa/pi_link.html), how "PISA may run queries launched from any Web site. Simply make hyperlink with the following URL." They go on to show the link and how you can add a signaling token to specify the corresponding table to get and the PDB identifier:

```text
 Token 	 Description 
qi	PISA retrieves precalculated results for the given PDB code and displays the corresponding interface table
qs	PISA retrieves precalculated results for the given PDB code and displays the corresponding structure table
qa	PISA retrieves precalculated results for the given PDB code and displays the corresponding assembly table
```

Thus `http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/piserver?qi=1stm` will get interface table of a PDB entry 1stm.

**Specifically**, the `qi=1stm` at the end is the part coming from the tokens and PDB identifier.

This approach can be used to retrieve the HTML for the same page you'd get if you went to `Launch PDBePISA` at [PDBePISA](https://www.ebi.ac.uk/pdbe/pisa/pistart.html) and then entered the PDB indentifier code for 1stm.


Putting that into action in Jupyter (and a letter command line magic) to fetch for the example the interactions list in a text:

In [11]:
!curl -o 1stm.txt -L http://www.ebi.ac.uk/pdbe/pisa/cgi-bin/piserver?qi=1stm

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 75207    0 75207    0     0   7998      0 --:--:--  0:00:09 --:--:-- 16208


That got the HTML and saved it as `1stm.txt`.

Let's look at the top part that some. We'll show the first 20 lines of what was retrieved by running the next cell.

In [12]:
!head -20 1stm.txt


<!doctype html>
<!-- paulirish.com/2008/conditional-stylesheets-vs-css-hacks-answer-neither/ -->
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en"> <![endif]-->
<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en"> <![endif]-->
<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en"> <![endif]-->
<!-- Consider adding an manifest.appcache: h5bp.com/d/Offline -->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
    <head>
        <meta charset="utf-8">

        <!-- Use the .htaccess and remove these lines to avoid edge case issues.
             More info: h5bp.com/b/378 -->
        <!-- <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> --> <!-- Not yet implemented -->

        <title>PDBe &lt; PISA &lt; EMBL-EBI</title>
        <meta name="description" content="EMBL-EBI"><!-- Describe what this page is about -->
        <meta name="keywords" content="bioinformatics, europe, institute"><!-- A few keywords that relate to the content of TH

Using Python you can read and parse this HTML. That's what the script does when you give it a PDB code for data you want.  

In the end it takes the parsed table and formats it into a Pandas dataframe.

--------

In the next two notebooks, I cover enhancing the dataframes with the names of the macromolecules in place of the letter designations of the chains so that the interface table dataframe is more informative, and then I cover scaling up make dataframes for a lot of PDB identifiers. Go to the index page and click through to those.

------