# Project Intro
## Exercise

Build a blast-based method to predict the presence of BPTI/Kunitz domain in proteins available in SwissProt using the human proteins as a reference.
* Select all Proteins in SwissProt with BPTI/Kunitz domain
* Separate human from non human proteins. Use the non human proteins as a positive in the testing set.
* Generate a random set of negative of the same size of the positive set.
* Remove both positives and negatives from SwissProt and perform the prediction based on the results of the blast search.

I downloaded from UniProt the following:
* **sprot_pf00014_global_id.txt**: identifiers in SwissProt for proteins with BPTI/Kunitz domain
* **prot_pf00014_human_id.txt**: identifiers in SwissProt for human proteins with BPTI/Kunitz domain
* **sprot_pf00014_non_human_id**: identifiers in SwissProt for non-human proteins with BPTI/Kunitz domain
* **sprot_non_pf00014_global_id.txt**: identifiers in SwissProt for proteins without BPTI/Kunitz domain
* **swissprot.fasta.gz**: the whole UniProt/Swiss-Prot database in fasta format

In [1]:
%%bash
cd ../dataset/
pwd
ls -lhtr

/home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj/dataset
total 112M
-rw-rw-r-- 1 alessandro alessandro 112M apr  1 14:48 swissprot.fasta.gz
lrwxrwxrwx 1 alessandro alessandro   23 apr  1 16:44 Snakefile -> ../snakefiles/Snakefile
lrwxrwxrwx 1 alessandro alessandro   43 apr  1 16:46 sprot_non_pf00014_global_id.txt -> ../handmade/sprot_non_pf00014_global_id.txt
lrwxrwxrwx 1 alessandro alessandro   39 apr  1 16:46 sprot_pf00014_global_id.txt -> ../handmade/sprot_pf00014_global_id.txt
lrwxrwxrwx 1 alessandro alessandro   38 apr  1 16:46 sprot_pf00014_human_id.txt -> ../handmade/sprot_pf00014_human_id.txt
lrwxrwxrwx 1 alessandro alessandro   42 apr  1 16:47 sprot_pf00014_non_human_id.txt -> ../handmade/sprot_pf00014_non_human_id.txt


I extracted the sequences in fasta format from **swissprot.fasta.gz** using the following filters (one extraction for each filter):
* sprot_pf00014_global_id.txt
* sprot_pf00014_human_id.txt
* sprot_pf00014_non_human_id.txt

In [2]:
%%bash
cd ../dataset/
pwd
snakemake -p sprot_pf00014_global_id_filter_on_swissprot.fasta.gz
snakemake -p sprot_pf00014_human_id_filter_on_swissprot.fasta.gz
snakemake -p sprot_pf00014_non_human_id_filter_on_swissprot.fasta.gz
ls -lhtr

/home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj/dataset
total 112M
-rw-rw-r-- 1 alessandro alessandro 112M apr  1 14:48 swissprot.fasta.gz
lrwxrwxrwx 1 alessandro alessandro   23 apr  1 16:44 Snakefile -> ../snakefiles/Snakefile
lrwxrwxrwx 1 alessandro alessandro   43 apr  1 16:46 sprot_non_pf00014_global_id.txt -> ../handmade/sprot_non_pf00014_global_id.txt
lrwxrwxrwx 1 alessandro alessandro   39 apr  1 16:46 sprot_pf00014_global_id.txt -> ../handmade/sprot_pf00014_global_id.txt
lrwxrwxrwx 1 alessandro alessandro   38 apr  1 16:46 sprot_pf00014_human_id.txt -> ../handmade/sprot_pf00014_human_id.txt
lrwxrwxrwx 1 alessandro alessandro   42 apr  1 16:47 sprot_pf00014_non_human_id.txt -> ../handmade/sprot_pf00014_non_human_id.txt
-rw-r--r-- 1 alessandro alessandro  37K apr  1 17:19 sprot_pf00014_global_id_filter_on_swissprot.fasta.gz
-rw-r--r-- 1 alessandro alessandro 8,4K apr  1 17:19 sprot_pf00014_human_id_filter_on_swissprot.fasta.gz
-rw-r--r-- 1 alessandro alessan

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	filterfasta
	1

rule filterfasta:
    input: swissprot.fasta.gz, sprot_pf00014_global_id.txt
    output: sprot_pf00014_global_id_filter_on_swissprot.fasta.gz
    jobid: 0
    wildcards: filter=sprot_pf00014_global_id, db=swissprot

python ../src/fastafilter.py swissprot.fasta.gz sprot_pf00014_global_id.txt | gzip > sprot_pf00014_global_id_filter_on_swissprot.fasta.gz
Finished job 0.
1 of 1 steps (100%) done
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	filterfasta
	1

rule filterfasta:
    input: swissprot.fasta.gz, sprot_pf00014_human_id.txt
    output: sprot_pf00014_human_id_filter_on_swissprot.fasta.gz
    jobid: 0
    wildcards: filter=sprot_pf00014_human_id, db=swissprot

python ../src/fastafilter.py swissprot.fasta.gz sprot_pf00014_human_id.txt | gzip > sprot_pf00014_human_id_filter_on_swissprot.fasta.gz
Finished job 0.
1 of 1 steps (100%) 

The positive set is made by the human sequences. There are 18 of them. The total number of sequences is 358. 

In [3]:
%%bash
cd ../dataset/
pwd
wc -l sprot_pf00014_human_id.txt
wc -l sprot_pf00014_global_id.txt
wc -l sprot_pf00014_non_human_id.txt

/home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj/dataset
18 sprot_pf00014_human_id.txt
358 sprot_pf00014_global_id.txt
340 sprot_pf00014_non_human_id.txt


Check for consistency in the number of sequences: OK

In [4]:
%%bash
cd ../dataset/
pwd
zcat sprot_pf00014_global_id_filter_on_swissprot.fasta.gz | grep -c '>'
zcat sprot_pf00014_human_id_filter_on_swissprot.fasta.gz | grep -c '>'
zcat sprot_pf00014_non_human_id_filter_on_swissprot.fasta.gz | grep -c '>'

/home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj/dataset
358
18
340


Database search using human sequences as seed database:
* sample 500 random identifiers from non-kunitz sequences
* extract fasta of the previous set
* index the databases
* search for homology with blast

In [5]:
%%bash
cd ../dataset/
pwd
snakemake -p sprot_non_pf00014_global_id_sampled500.txt
snakemake -p sprot_non_pf00014_global_id_sampled500_filter_on_swissprot.fasta.gz
snakemake -p sprot_pf00014_non_human_id_filter_on_swissprot.fasta.phr
snakemake -p sprot_non_pf00014_global_id_sampled500_filter_on_swissprot.fasta.phr
snakemake -p sprot_pf00014_human_id_filter_on_swissprot.fasta.phr

/home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj/dataset


Building a new DB, current time: 04/01/2019 17:19:51
New DB name:   /home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj/dataset/sprot_pf00014_non_human_id_filter_on_swissprot.fasta
New DB title:  sprot_pf00014_non_human_id_filter_on_swissprot.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 340 sequences in 0.00729704 seconds.


Building a new DB, current time: 04/01/2019 17:19:52
New DB name:   /home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj/dataset/sprot_non_pf00014_global_id_sampled500_filter_on_swissprot.fasta
New DB title:  sprot_non_pf00014_global_id_sampled500_filter_on_swissprot.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 500 sequences in 0.018121 seconds.


Building a new DB, current time: 04/01/2019 17:19:52
New DB name:   /home/alessandr

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	random_sampling
	1

rule random_sampling:
    input: sprot_non_pf00014_global_id.txt
    output: sprot_non_pf00014_global_id_sampled500.txt
    jobid: 0
    wildcards: list=sprot_non_pf00014_global_id, N=500

cat sprot_non_pf00014_global_id.txt | sort -R | sed -n '1,500p' > sprot_non_pf00014_global_id_sampled500.txt
Finished job 0.
1 of 1 steps (100%) done
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	filterfasta
	1

rule filterfasta:
    input: swissprot.fasta.gz, sprot_non_pf00014_global_id_sampled500.txt
    output: sprot_non_pf00014_global_id_sampled500_filter_on_swissprot.fasta.gz
    jobid: 0
    wildcards: filter=sprot_non_pf00014_global_id_sampled500, db=swissprot

python ../src/fastafilter.py swissprot.fasta.gz sprot_non_pf00014_global_id_sampled500.txt | gzip > sprot_non_pf00014_global_id_sampled500_filter_on_swissprot.fasta.gz
Finished

In the previous step databases were created and indexed. Blast searches have to be performed:
* human kunitz vs non-human kunits: to observe positive examples
* human kunitz vs random non-kunits: to observe negative examples  

Take into account the e-values and set a threshold to optimize the classification

In [6]:
%%bash
cd ../dataset/
pwd
snakemake -p sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_blast_run.gz
snakemake -p sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_non_pf00014_global_id_sampled500_filter_on_swissprot_blast_run.gz

/home/alessandro/Unibo/python-programming-alessandro-lussana/LB1/prj/dataset


Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	blast_run
	1

rule blast_run:
    input: sprot_pf00014_non_human_id_filter_on_swissprot.fasta, sprot_pf00014_human_id_filter_on_swissprot.fasta, sprot_pf00014_non_human_id_filter_on_swissprot.fasta.phr
    output: sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_blast_run.gz
    jobid: 0
    wildcards: query=sprot_pf00014_human_id_filter_on_swissprot, database=sprot_pf00014_non_human_id_filter_on_swissprot

blastpgp -i sprot_pf00014_human_id_filter_on_swissprot.fasta -d sprot_pf00014_non_human_id_filter_on_swissprot.fasta -e 1000 -m 8 | gzip > sprot_pf00014_human_id_filter_on_swissprot_vs_sprot_pf00014_non_human_id_filter_on_swissprot_blast_run.gz
Finished job 0.
1 of 1 steps (100%) done
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	blast_run
	1

rule blast_run:
    input: sprot_non_pf00014_global_id_sa

Maximize the classification performance to compute the confusion matrix; calculate:
* accuracy
* [...] coefficient