# Overview

We want to combine the proteins given by the CullPDB PISCES server [link] with the secondary structure information provided by DSSP [link].

## Prerequisites

### Cull PDB: PISCES

Proteins from the PDB can be queried based on criteria such as resolution, sequence identity, etc. It's possible (as of 20/03/2018) to download different lists [here](http://dunbrack.fccc.edu/Guoli/pisces_download.php).

### DSSP files

The PISCES lists provide PDB ID's, but they do not have the secondary structure information. To get that, you need to download the DSSP information from the PDB. This can be done directly [here](http://swift.cmbi.ru.nl/gv/dssp/).

By syncing the database locally, the individual \*.dssp files can be parsed by the script [here](https://gist.github.com/dillondaudert/94785e9cc0318ac69243c6283da3a032).

### Next Steps
The rest of this notebook will assume that a list downloaded from PISCES as well as some number of .csv files containing the parsed DSSP data exist in a `data/dssp` folder.

## Loading the Data

Essentially, we want to do a join on the PDB id field of the PISCES and DSSP datasets. Since these are both in either tab-separated or csv format, Pandas is an ideal candidate for doing this.

In [None]:
import pandas as pd
from pathlib import Path
datadir = str(Path(Path.home(), "data", "dssp"))

In [None]:
cpdb_df = pd.read_csv(datadir+"/cullpdb_pc30_res2.5_R1.0_d180208_chains15102.txt", delim_whitespace=True)
print(cpdb_df.columns)
print(cpdb_df.iloc[0])

In [None]:
# Check some assumptions about this data
all_length_5 = all(cpdb_df["IDs"].str.len() == 5)
cpdb_long_ids_df = cpdb_df["IDs"][cpdb_df["IDs"].str.len() != 5]
# All IDs are unique
unique_ids = len(cpdb_df["IDs"]) == len(cpdb_df["IDs"].unique())

In [None]:
dssp_1_df = pd.read_csv(datadir+"/dssp_1.csv")
print(dssp_1_df.columns)
print(dssp_1_df["dssp_id"][0], dssp_1_df["seq"][0])

In [None]:
# Check some assumptions about this data
all_length_4 = all(dssp_1_df["dssp_id"].str.len() == 4)
unique_ids = len(dssp_1_df["dssp_id"]) == len(dssp_1_df["dssp_id"].unique())

## Joining the Data

We want to concatenate the two datasets, joining on the two id's. Since the cpdb data is a subset of the dssp data, we join on the cpdb id field (after taking the first 4 characters only)

In [None]:
# get a dataframe consisting of only the first 4 characters of each cpdb id
cpdb_ids = cpdb_df["IDs"].str[0:4].str.lower()
print(cpdb_ids.loc[0:5])
# check if these are still unique
print(len(cpdb_ids) == len(cpdb_ids.unique()))
# only take the unique entries
cpdb_ids = cpdb_ids[~cpdb_ids.duplicated()]
print(cpdb_ids.is_unique)

# get the cpdb dataset with only unique, 4-letter IDs
cpdb_df_unique = cpdb_df[~cpdb_df["IDs"].duplicated()]

cpdb_df_unique["dssp_id"] = cpdb_ids
cpdb_df_unique.drop(labels=["IDs"], axis=1)

### Note on CPDB chains and DSSP ids
The PISCES server checks each CHAIN of a PDB entry individually. As such, the cpdb IDs may contain all or only some of the chains of a particular PDB entry. On the other hand, the DSSP outputs a single file / entry per PDB ID, which will include (*I assume*) all of the chains for that entry.

For simplicity, I will assume that if a particular 4-letter PDB entry appears at least once in the cpdb IDs, then the entire entry is acceptable to put in the dataset. This means that I can just take the unique 4-letter IDs from the cpdb_df and do an inner join with the dssp_dfs. Once this is done for all the dssp files, the resulting dataframe will contain the sequences, secondary structure, and other features of the PDB entries specified by the PISCES cull pdb list.

In [None]:
# merge the two datasets
merged_1 = dssp_1_df.merge(cpdb_df_unique, how="inner", on="dssp_id")
print(merged_1.columns)
print(len(merged_1))

In [None]:
# Now do this for all dssp files
merged_frames = []
for i in range(1,12):
    dssp_df = pd.read_csv(datadir+"/dssp_%d.csv" % i)
    merged = dssp_df.merge(cpdb_df_unique, how="inner", on="dssp_id")
    print(len(merged))
    merged_frames.append(merged)

In [None]:
all_merged = pd.concat(merged_frames)
print(len(all_merged))
print(all_merged["dssp_id"].is_unique)
# Write out the sequence and secondary structure to a file
all_merged[["dssp_id", "seq", "ss"]].to_csv(datadir+"/cpdb_dssp_%d.csv" % len(all_merged), index=False)

## Creating .TFRecords files