# Overview

This repo consists of a walkthrough for creating a dataset of protein sequences in a similar fashion as was done by [Zhou & Troyanskaya, 2014](https://arxiv.org/pdf/1403.1347.pdf). The authors of that paper used a dataset constructed as outlined below to train convolutional generative stochastic networks for protein secondary structure prediction (PSSP). The modest goal is to introduce individuals with experience in machine/deep learning to the basics of structural bioinformatics by constructing a dataset for PSSP as well as other, related tasks.

### Packages Used

- Python 3
- Pandas
- BioPython
- TensorFlow

## Proteins - primary sequence and secondary structure

Proteins are one of the most important classes of macromolecules in biology. They are composed of linear chains of 20 (sometimes 22) proteinogenic amino acids. Each amino acid can be represented by a single letter, allowing us to express proteins as a string. For example, the sequence for hemoglobin is:
>VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

When found in the appropriate biological context, proteins assume - through a complex interplay of physicochemical forces - a 3-dimensional conformation, referred to as the protein's *tertiary* or *native* structure. This native state determines how a particular protein behaves; that is, form determines function. Accurately predicting tertiary structure is therefore the first and most fundamental step for understanding the function of discovered proteins as well as for designing new proteins with entirely novel functions.

During the folding process, regions along the protein backbone assume local conformations called secondary structures. These secondary structures are characterized by the patterns of hydrogen bonds formed between amino acids in a particular region. One common way of categorizing such structures is given by the dictionary of protein secondary structires [DSSP](https://swift.cmbi.umcn.nl/gv/dssp/). Since nearly all secondary structures are localized to a particular region of the backbone, they can also be written as a string of letters, where each letter corresponds to a particular secondary structure. Hemoglobin, aligned with its secondary structures, is as follows:

>       VLSPADKTNV KAAWGKVGAH AGEYGAEALE RMFLSFPTTK TYFPHFDLSH GSAQVKGHGK KVADALTNAV AHVDDMPNAL SALSDLHAHK LRVDPVNFKL LSHCLLVTLA AHLPAEFTPA VHASLDKFLA SVSTVLTSKY R
>          HHHHHHH HHHHHHHGGG HHHHHHHHHH HHHHH GGGG GG TTS  ST T HHHHHHHH HHHHHHHHHH HTTTSHHHHT HHHHHHHHHT T   THHHHH HHHHHHHHHH HH TTT  HH HHHHHHHHHH HHHHHHHTT
           

## Combining Cull PDB and DSSP

The protein data bank [(PDB)](https://www.rcsb.org/) is an online repository of hundreds of thousands of proteins whose 3d structure has been resolved experimentally. We will use the web server CullPDB: PISCES to extract a diverse set of proteins from the PDB, and the use DSSP to find their secondary structures.

### Cull PDB: PISCES

Proteins from the PDB can be queried based on criteria such as resolution, sequence identity, etc. It's possible (as of 20/03/2018) to download different lists [here](http://dunbrack.fccc.edu/Guoli/pisces_download.php).

### DSSP files

The PISCES lists provide PDB ID's, but they do not have the secondary structure information. To get that, we need to download the DSSP information from the PDB. This can be done directly [here](http://swift.cmbi.ru.nl/gv/dssp/).

By syncing the database locally, the individual \*.dssp files can be parsed by the script [parse_dssp.py](./parse_dssp.py).

### Note on CPDB chains and DSSP ids
The PISCES server checks each CHAIN of a PDB entry individually. As such, the cpdb IDs may contain all or only some of the chains of a particular PDB entry. On the other hand, the DSSP outputs a single file / entry per PDB ID, which will include all of the chains for that entry.

The `parse_dssp.py` script will split dssp files into their constituent chains, appending the chain id to the end of the dssp_id, and creating records for each chain. Some edges cases where the parser incorrectly identifies the chain id exist, but those are skipped.

### Next Steps
The rest of this notebook will assume that a list downloaded from PISCES as well as some number of .csv files containing the parsed DSSP data exist in a `data/dssp` folder.

## Joining the Data

We want to do a join on the PDB id field of the PISCES and DSSP datasets. Since these are both in either tab-separated or csv format, Pandas is an ideal candidate for doing this.

We want to concatenate the two datasets, joining on the two id's. Since the cpdb data is a subset of the dssp data, we join on the cpdb id field. See [merge_data.py](./merge_data.py) for more details.

In the code below, we drop sequences below 26 residues in length and save the results of the merge to the file `cpdb2_records.csv`.

In [2]:
import pandas as pd
from pathlib import Path
datadir = str(Path(Path.home(), "data", "dssp"))

In [None]:
from merge_data import merge
merged = merge()

merged = merged[merged.seq.str.len() > 25]
merged.to_csv(datadir+"/cpdb2_records.csv")

## Generate the Position-Specific Similary Matrices
Similar to CPDB, we can calculate position-specific profile similarity scores using PSI-BLAST. The process is as follows:

### CPDB2 to FASTA format
In order to make calculating the PSSMs amenable to the CPDB2 dataset, we save the `dssp_id` and `seq` fields in FASTA file format.
Some of these records contain leftover `!` gap symbols, so these are replaced with `*` to indicate a gap of indeterminate length
in the sequence, as described [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp)

See the [csv_to_fasta.py](./csv_to_fasta.py) file.

### Download and preprocess UniRef90
Download the [UniRef90](http://www.uniprot.org/downloads) dataset, and filter using [pfilt](http://bioinf.cs.ucl.ac.uk/psipred/) to remove low information content and coiled-coil regions.

#### pfilt
See the [README](http://bioinfadmin.cs.ucl.ac.uk/downloads/pfilt/).


### Download BLAST+ and create a local database
Create a BLAST database out of the filtered sequences in FASTA format using the blast command line tool, described [here](https://www.ncbi.nlm.nih.gov/books/NBK279688/). The BLAST+ software is available [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download).

### Run PSIBLAST to get the scores
Run PSI-BLAST on the CullPDB dataset downloaded via PISCES against the BLAST database just created, 
with an inclusion threshold of 0.001 for 3 iterations.

Transform the profile scores into the 
range [0, 1) via logistic sigmoid
```bash
psiblast -db uniref90_filt_db -query example.fasta -out output.psiblast -evalue 0.001 -num_threads 4 -num_iterations 3 -out_pssm output.pssm -out_ascii_pssm asci_test.pssm -save_pssm_after_last_round
```

**Note** that PSIBLAST will only save the PSSM for a SINGLE protein at a time. Each new profile score overwrites the previous, so the sequences need to be run independently.

# Creating a dataset without PSIBLAST

## Creating .TFRecords files

We could use the string pairs directly as inputs to a learning model. The TensorFlow tf.data API allows for reading text data and converting to feature vectors as a preprocessing step in a model. However, this places a heavy computational bottleneck at the CPU that could slow down training. Since the dataset is ~14k sequences, we can save them directly as feature vectors in a .tfrecords file that is loaded into memory at training.

## Features

Oftentimes, position-specific features are calculated using PSIBLAST or hidden markov models. To keep things simple at this stage, we can simply append feature vectors that correspond to the amino acids / secondary structures of the sequence and save those as TF records. The features are the following:

In [3]:
aa_feats = pd.read_csv("./cpdb2_aa_features.csv", index_col=0)
aa_feats

Unnamed: 0,A,C,D,E,F,G,H,I,K,L,...,X,!,SOS,EOS,hydrophobicity,polar,hydropathy intensity,hydrophilicity,pH_l,vdW_vol
A,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.8,3.0,6.01,67.0
C,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-1.0,2.5,-1.0,5.05,86.0
D,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,1.0,-3.5,3.0,2.85,91.0
E,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,1.0,-3.5,3.0,3.15,109.0
F,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-1.0,2.8,-2.5,5.49,135.0
G,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.4,0.0,6.06,48.0
H,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-1.0,1.0,-3.2,-0.5,7.6,118.0
I,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,-1.0,4.5,-1.8,6.05,124.0
K,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,2.0,1.0,-3.9,3.0,9.6,135.0
L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,-1.0,3.8,-1.8,6.01,124.0


In [4]:
ss_feats = pd.read_csv("./cpdb2_ss_features.csv", index_col=0)
ss_feats

Unnamed: 0_level_0,H,B,E,G,I,T,S,U,SOS,EOS
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
H,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
E,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
G,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
I,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
T,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
S,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
U,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
SOS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
EOS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


The secondary structures are the targets, and so their "features" are simply one-hot encodings of the letters, including special SOS/EOS tokens.
The amino acid features include a one-hot encoding the 20 proteinogenic amino acids as well as six physicochemical properties: hydrophobicity, polarity, hydropathy intensity, hydrophilicity, isoelectric point, and van der Waals volume. 

See the [make_tfrecords.py](./make_tfrecords.py) script for how the protein sequences are processed into tf records.

These files are read into TensorFlow models using the tf.data API.

### Note on cpdb2 sequences with the character 'b', and 'j', and 'o', and 'u', and 'z'
According to the description of DSSP, lowercase characters indicate a SS-bridge Cysteine. These come in pairs; only some of them show up as bad characters, because when capitalized, some are valid amino acid codes.

The strategy for these is thus to replace them with 'C' for cysteine.