# Threshold optimization

The naive classificator method based on blast search needs an e-value threshold to minimize the erroneus assignments between kunitz and non-kunitz proteins

In [1]:
%%bash
pwd

/home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj_blast_classification/docs


## Confusion matrix
Create a file with these columns from the blast output

1 identifier

2 e-value

3 class (0 = negative, non-kunitz; 1 = positive, kunitz)

NOTE: in the blast output:

1 query id

2 target id

\[...\]

11 e-value

## List the minimum e-value for each sequence
First, we need to select the minimum e-value obtained from the blast search for each sequence:

In [2]:
%%bash
cd ../dataset/
pwd
snakemake -p sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot_blast_min_eval.gz
snakemake -p sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_blast_min_eval.gz
zcat sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot_blast_min_eval | head

/home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj_blast_classification/dataset
sp|A0B688|PRIS_METTP	98
sp|A0KS65|DAPF_SHESA	16
sp|A0PXQ5|PAND_CLONN	128
sp|A0Q3B8|MURI_CLONN	136
sp|A0T0X2|RK3_THAPS	62
sp|A1ANJ2|TRPF_PELPD	33
sp|A1AWT1|SYFA_RUTMC	4.8
sp|A1CHU1|MDM34_ASPCL	27
sp|A1JPP1|ZAPA_YERE8	133
sp|A1KRG4|RL10_NEIMF	56


Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	list_min_evalues
	1

rule list_min_evalues:
    input: sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot_blast_run.gz
    output: sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot_blast_min_eval.gz
    jobid: 0
    wildcards: blast_run=sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot

for id in $(zcat sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot_blast_run.gz | cut -f2 | sort | uniq); do zcat sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot_blast_run.gz | grep $id | cut -f2,11 | LC_ALL=c sort -gk2 | sed -n '1p'; done | gzip > sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_o

TODO: check consistency in number of lines
DONE: lines are consistent

Then, create a list to feed the confusion matrix program

In [3]:
%%bash
cd ../dataset/
pwd
snakemake -p conf_list_of_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_and_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot.gz
zcat conf_list_of_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_and_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot.gz | head

/home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj_blast_classification/dataset
sp|A0A1Z0YU59|MAMB1_DENAN	6.83e-15	1
sp|A5X2X1|VKT_SISCA	9.60e-17	1
sp|A6MFL1|VKT1_DEMVE	5.99e-15	1
sp|A6MFL2|VKT2_DEMVE	6.12e-15	1
sp|A6MFL3|VKT3_DEMVE	8.29e-17	1
sp|A6MFL4|VKT4_DEMVE	7.86e-17	1
sp|A6MGX9|VKT5_DEMVE	2.60e-15	1
sp|A6MGY1|VKT7_DEMVE	8.29e-17	1
sp|A7X3V4|VKT1_TELDH	7.26e-16	1
sp|A7X3V7|VKT1_PHIOL	1.18e-16	1


Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	compute_confusion_matrix_feed_file
	1

rule compute_confusion_matrix_feed_file:
    input: sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_blast_min_eval.gz, sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot_blast_min_eval.gz
    output: conf_list_of_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_and_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot.gz
    jobid: 0
    wildcards: p=sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot, n=sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot

cat <(zcat sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_blast_min_eval

Now all the data has been generated to perform the optimization of the e-value threshold in order to make the best possible predictions. Computation of the precision matrix is performed running the following rules; the final output files will contain these info:
* **TP**      number of true positives
* **TN**     number of true negatives
* **FP**      number of false positives
* **FN**      number of false negatives
* **acc**     accuracy
* **tpr**     sensivity - true positive rate
* **ppv**     precision - positive predictive value
* **mcc**     matthews correlation coefficient

TODO: Analysis of the area under the ROC should be implemented

In [4]:
%%bash
cd ../dataset/
pwd
snakemake -p conf_mat_th0.001_of_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_and_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot
cat conf_mat_th0.001_of_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_and_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot

/home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj_blast_classification/dataset
TP	339
TN	495
FP	5
FN	1
acc	0.992857
tpr	0.997059
ppv	0.985465
mcc	0.985252


Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	compute_confusion_matrix
	1

rule compute_confusion_matrix:
    input: conf_list_of_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_and_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot.gz
    output: conf_mat_th0.001_of_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_and_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot
    jobid: 0
    wildcards: th=0.001, p=sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot, n=sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot

python ../src/buildConfMatrix.py conf_list_of_sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_an

# Notes for HMM classification prj
## blastclust 
blastclust -i ids_kunitz.fasta -o ids_kunits.clust -L 0.8 -S 90

clusters together the sequences that share 90% sequence identity for 0.8 sequence length(?). The output is sorted such that in the first line there is the cluster with the greatest number of sequences

This will be very useful for checking and removing redundancy when considering queries in PDB

## HMM classificator framework
* fectch pdb for kunitz and non-kunitz
* check for redundancy and filter
* train hmm with training set
* optimize threshold
* test hmm with different set (do some ten-fold cross-validation)
* do further tests that will be discussed later

### fectch pdb files
Use RCSB PDB advanced search; note that you can filter for mutations, co-crystallization...

TODO: use pdbefold to have a list of structure given a selected one

## Report sections
* Datasets
* Model Building
* Optimization
* Performance

## Set function in python to compare lists
Function 'set' in python to perform intersection, union etc. Useful to compare lists of identifiers etc. 

## Endnotes
This might be useful for software packages: http://lipid.biocomp.unibo.it/emidio/tmp/
Song of the day: Banco de Gaia - Kincajou