# ProBis data preparation

In [1]:
from pathlib import Path

import pandas as pd

from probis_utils import parse_ligand_tables

## ProBis tables

ProBis ships the results in two `csv` files, i.e. 

- `simProtTable_XXXXA.csv` (*ProBis protein table*) and
- `predlig_XXXXA.csv` (*ProBis ligand table*),

where `XXXX` is the PDB ID and `A` the PDB chain.

In [2]:
PROBIS_FOLDER = Path('.') / '..' / 'data' / 'probis' / 'probis_pocket_15_0.5'

PROTEIN_PATH = PROBIS_FOLDER / 'simProtTable_6lu7A.csv'
LIGAND_PATH = PROBIS_FOLDER / 'predlig_6lu7A.csv'

## ProBis protein table

### Parse and save data

Not needed, file is ready for direct use!

### Load data

In [3]:
protein_table = pd.read_csv(PROTEIN_PATH)
protein_table.head()

Unnamed: 0,PDB ID,Chain ID,Protein Name,Pfam ID,SCOP ID,UniProt ID,Z-Score
0,4mds,A,3C-LIKE PROTEINASE,PF05409,,P0C6U8,3.97
1,4wme,A,MERS-COV 3CL PROTEASE,,,,3.83
2,2yna,A,3C-LIKE PROTEINASE,PF05409,,P0C6T4,3.76
3,3d23,A,3C-LIKE PROTEINASE,PF05409,,Q0ZJJ1,3.65
4,2zu2,A,3C-LIKE PROTEINASE,PF05409,,P0C6U2,3.49


### Save UniProt IDs

In [4]:
uniprot_ids = protein_table['UniProt ID'].dropna()
uniprot_ids.to_csv(f'{PROTEIN_PATH.parent / PROTEIN_PATH.stem}_uniprot_ids.csv', index=False, header=False)
len(uniprot_ids)

62

## ProBis ligand table

The `predlig_XXXXA.csv` file cannot be read in directly but needs to be parsed and saved as a "clean" csv file. So let's first take a look at the file structure.

The `predlig_XXXXA.csv` file contains predictions for 
(i) small molecules `small molecules`, 
(ii) proteins `proteins`, 
(iii) nucleic acids `nucleic`, and 
(iv) ions `ion`,
following this content scheme:

```
Type	Molecule Name	Residue Name	Source	Confidence	Binder

small molecules
Binding Site 1
...
Binding Site 2
...

proteins
Binding Site 1
...
Binding Site 2
...
Binding Site 3
...
Binding Site 4
...

nucleic
Binding Site 1
...
Binding Site 2
...

ion
Binding Site 1
...
```

The first line of the file describes:

- Type of data available
  - `Type`: e.g. `small molecules` > `Binding Site 1`
- Tables (this is what `...` refers to in the above scheme) content
  - `Molecule Name`
  - `Residue Name`: PDB residue ID of small molecule, protein residue, nucleic acid, or ion
  - `Source`: PDB ID from which data comes from  
  - `Confidence`
  - `Binder`

### Parse and save data

In [5]:
ligand_table = parse_ligand_tables(LIGAND_PATH)
ligand_table.to_csv(f'{LIGAND_PATH.parent / LIGAND_PATH.stem}_clean.csv', index=False)

### Load data

In [6]:
ligand_table = pd.read_csv(f'{LIGAND_PATH.parent / LIGAND_PATH.stem}_clean.csv')
ligand_table.head()

Unnamed: 0,Data type,Binding site name,Molecule Name,Residue Name,Source,Confidence,Binder
0,small molecules,Binding Site 1,"N-((3S,6R)-6-((S,E)-4-ETHOXYCARBONYL-1-((S)-2-...",CY6,2alv,3.97,Specific
1,small molecules,Binding Site 1,"ETHYL (5S,8S,11R)-8-BENZYL-5-(3-TERT-BUTOXY-3-...",G82,3tiu,3.97,Specific
2,small molecules,Binding Site 1,"(5S,8S,14R)-ETHYL 11-(3-AMINO-3-OXOPROPYL)-8-B...",AZP,2a5i,3.97,Specific
3,small molecules,Binding Site 1,"(5S,8S,14R)-ETHYL 11-(3-AMINO-3-OXOPROPYL)-8-B...",AZP,2a5i,3.97,Specific
4,small molecules,Binding Site 1,N-[(BENZYLOXY)CARBONYL]-O-TERT-BUTYL-L-THREONY...,ZU5,2zu5,3.97,Specific
