# Further guide to using the dataframes produced by `pisa_interface_list_to_df.py`

Because the multiindex nature of the dataframes produced by `pisa_interface_list_to_df.py` are a bit more complex than a typical, I though a little more assistance in dealing with them may prove useful. This notebook is an effort to further touch on the special handling steps that maybe  needed.

------

### Preparation

This notebook is designed so that it is able to be run separate from others in the series, and so some preparation is required at the outset.   This preparation doesn't cause any issues if it was already run, and so just go ahead and rerun it if you need to start the notebook over.

All these preparation steps were covered in the previous notebook in the first notebook in series, and so go work though that notebook if anything is not familiar.

In [1]:
# Get script file if not yet retrieved / check if file exists
import os
file_needed = "pisa_interface_list_to_df.py"
if not os.path.isfile(file_needed):
    !curl -OL https://raw.githubusercontent.com/fomightez/structurework/master/pdbepisa-utilities/{file_needed}

In [2]:
import pandas as pd

In [3]:
# import the main function of the script into the notebooks's namespace 
# so that it can be used here
from pisa_interface_list_to_df import pisa_interface_list_to_df

This last step will run the script and produce a dataframe of the interface data and read it into this notebook's namespace.

In [4]:
df = pisa_interface_list_to_df('6ahu')

Output()

------

## Dealing with the multiindex complexity by bypassing it.

Note that the MultiIndex columns here make things little more difficult than with typical, single-level dataframes. If you need evidence of that to see how usually it is much easier working with dataframs, see the first two notebooks that come up with you launch a session from my [blast-binder](https://github.com/fomightez/blast-binder) site. Those first two notebooks cover using the dataframe containing BLAST results some. Also for groupby that is easier, see the second notebook in active launches from in my [psbdum-binder](https://github.com/fomightez/pdbsum-binder) site. 

However, for the the dataframe produced by the script `pisa_interface_list_to_df.py`, we get the MultiIndex column / Hierarchical indexing. If we don't care how well the column labels parallel PDBePISA's Interfaces webpage, we have options. Let's show collapsing the MultiIndex/ hierarchical columms for easier use of typical Pandas methods and functions. Hopefully, you'll find the results much easier to work with and curse me for not showing this sooner.

We'll show the first few rows of the dataframe being used here as an exammple once more so changes upon collapsing will be obvious.

In [5]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 6_level_1,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1,B,578,143,46081,◊,A,717,101,60764,6096.4,-80.6,0.981,93,0,0,0.0
1,2,H,213,53,8805,◊,L,269,68,19954,2298.1,-28.0,0.096,20,2,0,0.0
2,3,C,176,44,13848,◊,K,193,51,9529,1757.2,-15.6,0.199,14,3,0,0.0
3,4,G,190,52,9528,◊,A,204,21,60764,1683.4,-2.0,0.936,12,0,0,0.0
4,5,K,157,40,9529,◊,A,188,35,60764,1643.0,-27.2,0.692,21,0,0,0.0


One way to collapse is to just tell tell Pandas to take only one level of the label names. We'll tell Pandas to use the lower one of the two.

In [6]:
df_collapsed1 = df.copy()
df_collapsed1.columns = df_collapsed1.columns.get_level_values(1)

The copying here was done to keep the original intact. However, you probably would also want to **keep the original intact for viewing** in a notebook because the collapsed versions don't look as nice displaying in Jupyter. 

In [7]:
df_collapsed1.head()

Unnamed: 0,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 6,Chain label.1,Number_InterfacingAtoms.1,Number_InterfacingResidues.1,Surface (Å$^2$).1,Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1,B,578,143,46081,◊,A,717,101,60764,6096.4,-80.6,0.981,93,0,0,0.0
1,2,H,213,53,8805,◊,L,269,68,19954,2298.1,-28.0,0.096,20,2,0,0.0
2,3,C,176,44,13848,◊,K,193,51,9529,1757.2,-15.6,0.199,14,3,0,0.0
3,4,G,190,52,9528,◊,A,204,21,60764,1683.4,-2.0,0.936,12,0,0,0.0
4,5,K,157,40,9529,◊,A,188,35,60764,1643.0,-27.2,0.692,21,0,0,0.0


Notice that to get the lower one, or the second one listed, `.get_level_values(1)` was used in the assignment of the new columns. This may seem odd to use `(1)` to specify the **second** set of elements listed; however, this comes from Python using zero-indexing, which means it uses index of value `0` to refer to the **first** element in any iterable.

However, that made two columns named 'Chain label' and 'Number_InterfacingAtoms', and so on. To show that, look at:

In [8]:
df_collapsed1.columns 

Index(['row #', 'Chain label', 'Number_InterfacingAtoms',
       'Number_InterfacingResidues', 'Surface (Å$^2$)', ' ', 'Chain label',
       'Number_InterfacingAtoms', 'Number_InterfacingResidues',
       'Surface (Å$^2$)', 'Area (Å$^2$)', 'Solvation free energy gain',
       'Solvation gain P-value', 'Hydrogen bonds', 'Salt Bridges',
       'Disuflides', 'CSS'],
      dtype='object')

That's not going to make it paricularly easy to call individual columns. For example, specifying the 'Chain label' column gives:

In [9]:
df_collapsed1['Chain label'].head()

Unnamed: 0,Chain label,Chain label.1
0,B,A
1,H,L
2,C,K
3,G,A
4,K,A


Another way to show they aren't unique would be by using to Python's set math to reduce the columns to the unique ones, and see if that number of columns matches the total, like so:

In [10]:
len(df_collapsed1.columns) == len(set(df_collapsed1.columns))

False

Checking the equality condition on the cell above is `False` because `len(df_collapsed1.columns)` is 17 and `len(set(df_collapsed1.columns))` is `13`.

We can actual collapse the contents of the multi-level columns because they were designed to make the columns unique and so we can keep that information by using the Python's join method in conjunction with Pandas' series `.map` method.

In [11]:
df_collapsed2 = df.copy()
df_collapsed2.columns = df_collapsed2.columns.map(' '.join)

In [12]:
df_collapsed2.head()

Unnamed: 0,row #,Chain 1 Chain label,Chain 1 Number_InterfacingAtoms,Chain 1 Number_InterfacingResidues,Chain 1 Surface (Å$^2$),x,Chain 2 Chain label,Chain 2 Number_InterfacingAtoms,Chain 2 Number_InterfacingResidues,Chain 2 Surface (Å$^2$),Interface Area (Å$^2$),Interface Solvation free energy gain,Interface Solvation gain P-value,Interface Hydrogen bonds,Interface Salt Bridges,Interface Disuflides,Interface CSS
0,1,B,578,143,46081,◊,A,717,101,60764,6096.4,-80.6,0.981,93,0,0,0.0
1,2,H,213,53,8805,◊,L,269,68,19954,2298.1,-28.0,0.096,20,2,0,0.0
2,3,C,176,44,13848,◊,K,193,51,9529,1757.2,-15.6,0.199,14,3,0,0.0
3,4,G,190,52,9528,◊,A,204,21,60764,1683.4,-2.0,0.936,12,0,0,0.0
4,5,K,157,40,9529,◊,A,188,35,60764,1643.0,-27.2,0.692,21,0,0,0.0


In [13]:
df_collapsed2.columns

Index(['  row #', 'Chain 1 Chain label', 'Chain 1 Number_InterfacingAtoms',
       'Chain 1 Number_InterfacingResidues', 'Chain 1 Surface (Å$^2$)', 'x  ',
       'Chain 2 Chain label', 'Chain 2 Number_InterfacingAtoms',
       'Chain 2 Number_InterfacingResidues', 'Chain 2 Surface (Å$^2$)',
       'Interface Area (Å$^2$)', 'Interface Solvation free energy gain',
       'Interface Solvation gain P-value', 'Interface Hydrogen bonds',
       'Interface Salt Bridges', 'Interface Disuflides', 'Interface CSS'],
      dtype='object')

Now we'll check computationally these are all unique, similar to how we tried in attempt 1 above by seeing if the total number of columns is the same as the number of unique columns:

In [14]:
len(df_collapsed2.columns) == len(set(df_collapsed2.columns))

True

If you are happy with the collapsed form, you can use it to do further downstream analysis. Applying methods and functions that involve the columns will have syntax that is easier to write. And indeed some typical Pandas features may only be available using the collapsed form.

For an example, the renaming is easier to write:

In [15]:
df_collapsed2 = df_collapsed2.rename(columns={'Chain 1 Chain label':'Chain1'}) 

And then if you have a column name that is a single word, you can use Pandas attribute notation to make it easier to specify a column.  
For example, `df_collapsed2.Chain1_label` is easier to write than `df_collapsed2['Chain1']`.

In [16]:
df_collapsed2.Chain1

0     B
1     H
2     C
3     G
4     K
5     B
6     D
7     F
8     D
9     E
10    J
11    C
12    H
13    B
14    D
15    L
16    F
17    E
18    A
19    E
20    E
21    D
22    B
23    H
24    I
25    J
26    D
27    I
28    E
29    H
30    C
31    E
32    K
33    J
34    K
35    G
36    B
Name: Chain1, dtype: object

You can imagine as you move towards doing more complex Pandas processing, minimizing the complexity of writing the commands can start contributing substantially to ease of use.

## Adding more datatypes to the dataframe

Not all the columns in the dataframes produced by `pisa_interface_list_to_df.py` have defined datatypes. This is to make it simpler while allowing the greatest flexibility in downstream analsysis. Important to why the choice was made to keep it simpler is the fact not all the dataframes from all structure data look like above; you can see more of a range by running [this notebook](tests_of_pisa_interface_list_to_df.py.ipynb). For example, the example used here looks like the following for datatypes:

In [17]:
df.dtypes

           row #                           int64
Chain 1    Chain label                    object
           Number_InterfacingAtoms         int64
           Number_InterfacingResidues      int64
           Surface (Å$^2$)                 int64
x                                         object
Chain 2    Chain label                    object
           Number_InterfacingAtoms         int64
           Number_InterfacingResidues      int64
           Surface (Å$^2$)                 int64
Interface  Area (Å$^2$)                  float64
           Solvation free energy gain    float64
           Solvation gain P-value        float64
           Hydrogen bonds                  int64
           Salt Bridges                    int64
           Disuflides                      int64
           CSS                           float64
dtype: object

Notice that while a lot of them are decimal value, indicated by `float`, and integer types, several are `object` datatype (`dtype`)  which means they come from columns where mixed types of data. 

What you see when you run `df.dtypes` may differ markedly if content in your columns and rows varies from the example here. In particular, those with 'average' columns look much different.   
Let's see an example of that by running the next two cells:

In [18]:
dfav = pisa_interface_list_to_df('1trn')

In [19]:
dfav.dtypes

           Id                             object
           row #                          object
Chain 1    Chain label                    object
           Number_InterfacingAtoms        object
           Number_InterfacingResidues     object
           Surface (Å$^2$)                object
x                                         object
Chain 2    Chain label                    object
           SymOp                          object
           SymID                          object
           Number_InterfacingAtoms        object
           Number_InterfacingResidues     object
           Surface (Å$^2$)                object
Interface  Area (Å$^2$)                  float64
           Solvation free energy gain    float64
           Solvation gain P-value        float64
           Hydrogen bonds                  int64
           Salt Bridges                    int64
           Disuflides                      int64
           CSS                           float64
dtype: object

Note that a lot of these are `object` dtype because the 'average' rows cause a lot of blank cells in the left part of the table. By looking at the first few rows, that can be see:

In [20]:
dfav.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,Id,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 7_level_1,Chain label,SymOp,SymID,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1.0,1.0,A,68.0,19.0,9795.0,x,A,"-y,x,z",3_555,81.0,20.0,9795,726.0,-3.3,0.416,8,1,0,0.0
1,,2.0,B,79.0,19.0,9608.0,x,B,"-y+1,x,z",3_655,65.0,20.0,9608,698.3,-2.2,0.553,10,2,0,0.0
2,,,,,,,,,,,,,**_Average:_**,712.1,-2.7,0.484,9,2,0,0.0
3,2.0,3.0,B,26.0,11.0,9608.0,◊,A,"x,y,z",1_555,29.0,13.0,9795,238.1,-1.8,0.399,2,0,0,0.0
4,3.0,4.0,A,27.0,12.0,9795.0,◊,B,"x,y,z-1",1_554,25.0,8.0,9608,216.4,-3.3,0.13,3,0,0,0.0


See how the row with index `2` is blank most of the way across the table until the 8th row. And how the `**_Average:_**` label in that row 'disrupts' the values in the `Chain 2  Surface (Å$^2$)` column. That label there causes that column to be mixed, and so it is `object` in this case and not `int64` as that specific column is for `df`.

That should give you some insight into what is going on. Let's explore how we'd work around that using a practical example.

If you've programmed some, you probably have found computers need defined types of data often to be able to handle it correctly. And so you may need to define the datatypes (`dtypes`) in those `object` columns that have no defined dtype to manage to do further processing computationally.    
Let's demonstrate that.

In the previous notebook, we used limiting the dataframe to rows that met a condition following by sending a column to a list and then used set math to limit it to the unique values to show what residues of a chain were involved in forming interfaces. We'll attempt to do this for when Chain B here with the following code:

In [21]:
when_B_chain1 = dfav[dfav[('Chain 1','Chain label')] == 'B']
when_B_chain1_res = when_B_chain1[('Chain 1','Number_InterfacingResidues')].tolist()
when_B_chain2 = dfav[dfav[('Chain 2','Chain label')] == 'B']
when_B_chain2_res = when_B_chain2[('Chain 2','Number_InterfacingResidues')].tolist()
chain_B_res_interacting_set = set(when_B_chain1_res + when_B_chain2_res)
chain_B_res_interacting_set

{'11', '13', '19', '20', '3', '8', '9'}

You might notice there is an issue:

- The numbers are flanked by quotes and **that means they are strings and not integers**.

The latter would mean you cannot simply iterate on the elements of this list and tell if the positions are before or after position 10 using code like this:

```python
arbitrary_pos = 10
for residue in chain_B_res_interacting_set:
    if residue > 10:
        print(f"Residue {residue} is after {arbitrary_pos}.")
    else:
        print(f"Residue {residue} is before {arbitrary_pos}.")
```

You can make a new cell and run that code.  
If you do, you'll get:

```python
TypeError: '>' not supported between instances of 'str' and 'int'
```

This is because the elements of `chain_B_res_interacting_set` are strings and not integers because they came from the column that has the`object` dtype.

You'd have similar problems if you tried to subsequent analysis with other data in the `object` columns.

In the simple example, we could cast the residue to an integer as we iterate on it.

```python
arbitrary_pos = 10
for residue in chain_B_res_interacting_set:
    if int(residue) > 10:
        print(f"Residue {residue} is after {arbitrary_pos}.")
    else:
        print(f"Residue {residue} is before {arbitrary_pos}.")
```
And that would be perfectly valid if that single analyses was all you planned.  
However, the data in `dfav` & `chain_B_res_interacting_set` aren't easily useable because you only fixed some derive data. If you were going to do a lot of different analyses it would be better to have the data in the correct form at the source to make all the downstream steps easier and consistent.  
So let's change that in the dataframe.  
First we'll do the `('Chain 1','Number_InterfacingResidues')` column. Because it is a good habit to test things on a copy of your dataframe first, we'l make a copy to start and work with that. (In fact, you may not like the view of the result as much as the dataframe that comes directly from `pisa_interface_list_to_df.py` and so even though you make a new dataframe to do subsequent analyses with, you may wish to keep the original for most viewing steps.) 

In [22]:
dfav_fixed = dfav.copy()
dfav_fixed[('Chain 1','Number_InterfacingResidues')] = pd.to_numeric(dfav_fixed[('Chain 1','Number_InterfacingResidues')], downcast="integer")
dfav_fixed.dtypes

           Id                             object
           row #                          object
Chain 1    Chain label                    object
           Number_InterfacingAtoms        object
           Number_InterfacingResidues    float64
           Surface (Å$^2$)                object
x                                         object
Chain 2    Chain label                    object
           SymOp                          object
           SymID                          object
           Number_InterfacingAtoms        object
           Number_InterfacingResidues     object
           Surface (Å$^2$)                object
Interface  Area (Å$^2$)                  float64
           Solvation free energy gain    float64
           Solvation gain P-value        float64
           Hydrogen bonds                  int64
           Salt Bridges                    int64
           Disuflides                      int64
           CSS                           float64
dtype: object

It's now `float64`. This is a convention where you cannot mix Nan and integers but you can mix Nan and floats. Luckily, math will still work. And at least the numbers are numbers now and not strings.  
Let's change the other column.

In [23]:
dfav_fixed[('Chain 2','Number_InterfacingResidues')] = pd.to_numeric(dfav_fixed[('Chain 2','Number_InterfacingResidues')], downcast="integer")
dfav_fixed.dtypes

           Id                             object
           row #                          object
Chain 1    Chain label                    object
           Number_InterfacingAtoms        object
           Number_InterfacingResidues    float64
           Surface (Å$^2$)                object
x                                         object
Chain 2    Chain label                    object
           SymOp                          object
           SymID                          object
           Number_InterfacingAtoms        object
           Number_InterfacingResidues    float64
           Surface (Å$^2$)                object
Interface  Area (Å$^2$)                  float64
           Solvation free energy gain    float64
           Solvation gain P-value        float64
           Hydrogen bonds                  int64
           Salt Bridges                    int64
           Disuflides                      int64
           CSS                           float64
dtype: object

So what does the dataframe looks like in those changed columns? We can see by just printing the first few rows:

In [24]:
dfav_fixed.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Chain 1,Chain 1,Chain 1,Chain 1,x,Chain 2,Chain 2,Chain 2,Chain 2,Chain 2,Chain 2,Interface,Interface,Interface,Interface,Interface,Interface,Interface
Unnamed: 0_level_1,Id,row #,Chain label,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Unnamed: 7_level_1,Chain label,SymOp,SymID,Number_InterfacingAtoms,Number_InterfacingResidues,Surface (Å$^2$),Area (Å$^2$),Solvation free energy gain,Solvation gain P-value,Hydrogen bonds,Salt Bridges,Disuflides,CSS
0,1.0,1.0,A,68.0,19.0,9795.0,x,A,"-y,x,z",3_555,81.0,20.0,9795,726.0,-3.3,0.416,8,1,0,0.0
1,,2.0,B,79.0,19.0,9608.0,x,B,"-y+1,x,z",3_655,65.0,20.0,9608,698.3,-2.2,0.553,10,2,0,0.0
2,,,,,,,,,,,,,**_Average:_**,712.1,-2.7,0.484,9,2,0,0.0
3,2.0,3.0,B,26.0,11.0,9608.0,◊,A,"x,y,z",1_555,29.0,13.0,9795,238.1,-1.8,0.399,2,0,0,0.0
4,3.0,4.0,A,27.0,12.0,9795.0,◊,B,"x,y,z-1",1_554,25.0,8.0,9608,216.4,-3.3,0.13,3,0,0,0.0


Notice the `Nan` entries in the columns we altered. This stands for 'Not a Number' and is a special data value that Pandas and Python's Numerical Python module, `numpy`, use. Behind the scenes it is floating point representation, and so now the data types in the column are consistent and **no longer 'mixed'**.

Now that we know it works, we'll assign the 'fixed' dataframe to `dfav` and then use the code worked out above to collect the set of residues interacting and see what happends if we check if the residues are above or below:

In [25]:
dfav = dfav_fixed
# next use code from above to get the interactingr residues for chain B
when_B_chain1 = dfav[dfav[('Chain 1','Chain label')] == 'B']
when_B_chain1_res = when_B_chain1[('Chain 1','Number_InterfacingResidues')].tolist()
when_B_chain2 = dfav[dfav[('Chain 2','Chain label')] == 'B']
when_B_chain2_res = when_B_chain2[('Chain 2','Number_InterfacingResidues')].tolist()
chain_B_res_interacting_set = set(when_B_chain1_res + when_B_chain2_res)
chain_B_res_interacting_set
# next step through each element of the set checking above and below
arbitrary_pos = 10
for residue in chain_B_res_interacting_set:
    if residue > 10:
        print(f"Residue {residue} is after {arbitrary_pos}.")
    else:
        print(f"Residue {residue} is before {arbitrary_pos}.")

Residue 3.0 is before 10.
Residue 8.0 is before 10.
Residue 9.0 is before 10.
Residue 11.0 is after 10.
Residue 13.0 is after 10.
Residue 19.0 is after 10.
Residue 20.0 is after 10.


In this simplistic example case, the benefit of changing the dtype in the dataframe isn't overly clear. Especially since the `print` statements now look odd. However, you may want to do more complex Pandas steps and Pandas is good with handling `Nan` values, and so your life will be easier. And with consistent dtypes in the source dataframes, one place where you can see things would be more reproducible is that you can step to process data from corresponding columns in `df` and `dfav` now because both have the same dtype.

Recall an earlier notebook covered storing the dataframe in a compressed binary 'pcikled' form and you may wish to do that with a modified version so that you easily pick up with the data in the improved form later.

--------

Continue on with the next notebook in the series????. In the next notebook, I cover ...  
Go to the index page and click through to other notebooks after the next in the series if you prefer.

------

-----