# Assessment of Secondary Structure Prediction

Using the secondary structure annotation generated by MOE on the PDB Structure 1XP0, assess the quality of PSIPRED and JPred. 


In [1]:
moe     = """--HHHHHHHHHHH-----HHHHTTT-TT---TT--HHHHHHHHHHHHHHTTHHHHTT--HHHHHHHHHHHHHT--TT-----HHHHHHHHHHHHHHHHTT--HHH--HHHHHHHHHHHHHTTTT-----HHHHHHTT-HHHHHTTT--HHHHHHHHHHHHHHT-TT--TTTT--HHHHHHHHHHHHHHHHHT-HHHHHHHHHHHHHHHHTT---TT-HHHHHHHHHHHHHHHHTHHHH--HHHHHHHHHHHHHHHHHHHHHHHHHT-----HHH-HHHHHHHHHHHHHHHHHTTHHHHHHHHHH-HHHHHHHHHHHHHHHHHHHHT--"""
psipred = """----HHHHHHHHH-----------------------HHHHHHHHHHHHHH---------HHHHHHHHHHH-------------HHHH-HHHHHHHHHH---HHHHHHHHHHHHHHHHHHHH---------HHHHH---HHHHH-----HHHHHHHHHHHHHH------------HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH---------HHHHHHHHHHHHHH--------HHHHHHHHHHHHHH----HHHHHH---------------------HHHHHHHHHHHHHHHHHH----HHHHHHHHHHHHHHHHHHHH"""
jpred   = """-E--HHHHHHH-------HHH----------------HHHHHHHHHHHH----------HHHHHHHHHHHHHH---------HHHHHHHHHHHHHHHHH---------HHHHHHHHHHHHH---------HHHHHH---HHHH-----HHHHHHHHHHHHHH------------HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-----------HHHHHHHHHHH---------HHHHHHHHHHHHHHHHHHHHHHHHH-------------------HHHHHHHHHHHHHHHHHHHH---HHHHHHHHHHHHHHHHHHHHH""" 

moe = moe.replace("T", "-")
jpred = jpred.replace("E", "H")

## Analysis of Data

Because PDE5A's catalytic domain is mostly helicies, the specifity and sensitivity calculations were formulated as a coil being normal, and a helix being a predicted value. This meant that MOE's predictions of turns and the one beta sheet predicted by JPRED were interpreted as a helix, since it is a false positive. 

In [2]:
p_fn, p_fp, p_tp, p_tn = 0, 0, 0, 0
j_fn, j_fp, j_tp, j_tn = 0, 0, 0, 0

assert len(moe) == len(psipred) == len(jpred)

for m, p, j in zip(moe, psipred, jpred):
    # Statistics over the PSIPRED Data
    if m == 'H' and p == '-':
        p_fn += 1
    if m == '-' and p == 'H':
        p_fp += 1
    if m == 'H' and p == 'H':
        p_tp += 1
    if m == '-' and p == '-':
        p_tn += 1   
    
    # Statistics over the JPred Data
    if m == 'H' and j == '-':
        j_fn += 1
    if m == '-' and j == 'H':
        j_fp += 1
    if m == 'H' and j == 'H':
        j_tp += 1
    if m == '-' and j == '-':
        j_tn += 1
        
print("PSIPRED tp {} tn {} fp {} fn {}".format(p_tp, p_tn, p_fp, p_fn))    
print("JPRED   tp {} tn {} fp {} fn {}".format(j_tp, j_tn, j_fp, j_fn))

PSIPRED tp 187 tn 84 fp 13 fn 44
JPRED   tp 191 tn 85 fp 12 fn 40


## Sensitivity

Sensitivity measures the proportion of correctly identified positive values. In this example, positives are helicies. It is calculated by TP / P = TP / (TP + FN)

In [3]:
p_sens = p_tp / (p_tp + p_fn)
j_sens = j_tp / (j_tp + j_fn)

print("PSIPRED Sensitivity: {:.2%}".format(p_sens))
print("JPRED   Sensitivity: {:.2%}".format(j_sens))

PSIPRED Sensitivity: 80.95%
JPRED   Sensitivity: 82.68%


## Specificity
Specifity measures the proportion of correctly identified negative values. In this example, negatives are coils. It is calculated by TN/N = TN/(TN+FP)

In [4]:
p_spec = p_tn / (p_tn + p_fp)
j_spec = j_tn / (j_tn + j_fp)

print("PSIPRED Specificity: {:.2%}".format(p_spec))
print("JPRED   Specificity: {:.2%}".format(j_spec))

PSIPRED Specificity: 86.60%
JPRED   Specificity: 87.63%


## F1 Score
F1 Score is the harmonic mean of sensitivity and specificity. It is a good "average" that can be reported easily. It is calculated by 2TP / (2TP + FP + FN)

In [5]:
p_f1 = 2 * p_tp / (2 * p_tp + p_fp + p_fn)
j_f1 = 2 * j_tp / (2 * j_tp + j_fp + j_fn)

print("PSIPRED F1: {:.2%}".format(p_f1))
print("JPRED   F1: {:.2%}".format(j_f1))

PSIPRED F1: 86.77%
JPRED   F1: 88.02%


## Conclusions

Both the sensitivity and specificity of the JPRED methods were better for PDB 1xp0. Both PSIPRED and JPRED were successful at identifying regions of helicies, but had a small amount of error with respect to the length. 

This analysis is a simple case - for other proteins with folds that are not all-alpha, this type of analysis will be slightly more complicated. Perhaps these methods are better at predicting proteins of different folds. Ultimately, these results can be contextualized with the results of the WS1516 Structural Bioinformatics class on a wide variety of proteins.

## References
* https://en.wikipedia.org/wiki/Sensitivity_and_specificity