# Chapter 19 – In Search of Diﬀerences in Proteomes
This project introduces two serotypes of Escherichia coli: one pathogenic and one non-pathogenic variety. The serotype O157:H7 emerges as a significant cause of foodborne illness, notably linked to undercooked meat since its detection in 1982. Phylogenetic analyses suggest that O157:H7 diverged from a common ancestor around 4.5 million years ago, acquiring its pathogenicity possibly through horizontal gene transfer. Can we identify proteins associated with pathogenicity among those acquired genes? To answer this question, we compare the translated, annotated genomes of one non-pathogenic and one pathogenic serotype. This project aims to uncover the presence of diﬀerent genes in different but related genomes. Central to this analysis is the Basic Local Alignment Search Tool (BLAST+) that we run locally and in the terminal. For sequence download, I introduce the rather new tool NCBI Databases.

Installation of NCBI Datasets and BLAST required.

## Downloading Proteoms

In [None]:
for i in GCF_000005845.2 GCF_000008865.2; do ./datasets download genome accession $i --include protein --filename $i.zip; done

In [None]:
unzip GCF_000005845.2.zip

In [None]:
unzip GCF_000008865.2

In [None]:
grep -c ">" ec*.fasta

## Creating BLAST DB

In [None]:
makeblastdb -in ec-k12.fasta -dbtype prot -title "Escherichia coli K12" -out ecolik12 -parse_seqids

In [None]:
ls -l ecolik12*

## BLASTing

In [None]:
time blastp -db ecolik12 -query ec-h7.fasta -out h7vsk12.txt -evalue .00001

In [None]:
ls -lh ec-* h7*

In [None]:
wc -l h7vsk12.txt

## Processing the BLAST Result File

In [None]:
awk '/Query=/ || /No hits/{print}' h7vsk12.txt | head -20

In [None]:
awk '/Query=/ || /No hits/{print $0}' h7vsk12.txt | awk '{line[NR]=$0; if($0~/No hits/){print line[NR-1]}}' | head

In [None]:
awk '/Query=/ || /No hits/{print $0}' h7vsk12.txt | awk '{line[NR]=$0; if($0~/No hits/){print line[NR-1]}}' | wc -l

In [None]:
awk '/Query=/ || /No hits/{print $0}' h7vsk12.txt | awk '{line[NR]=$0; if($0~/No hits/){print line[NR-1]}}' | egrep -v "([Uu]nknown| [Pp]utative|[Hh]ypothetical|[Uu]ncharacterized)" | head -20

In [None]:
awk '/Query=/ || /No hits/{print $0}' h7vsk12.txt | awk '{line[NR]=$0; if($0~/No hits/){print line[NR-1]}}' | egrep -v "([Uu]nknown| [Pp]utative|[Hh]ypothetical|[Uu]ncharacterized)" | wc -l

## Playing with the E-Value

``` 
#!/bin/bash
# save as autoblast.sh
# loops through E-value
for i in 1 0.001 0.00001
do
echo "Working on h7vsk12-$i.txt"
blastp -db ecolik12 -query ec-h7.faa -out h7vsk12-$i.txt -evalue $i
done

In [None]:
time ./autoblast.sh

In [None]:
ls h7vsk12*

In [None]:
wc -l h7vsk12*

In [None]:
for i in h7vsk12-*; do echo -n $i" : "; awk '/Query=/ || /No hits/{print $0}' $i | awk '{line[NR]=$0; if($0~/No hits/){print line[NR-1]}}' | wc -l; done

In [None]:
for i in h7vsk12-*; do echo -n $i" : "; awk '/Query=/ || /No hits/{print $0}' $i | awk '{line[NR]=$0; if($0~/No hits/){print line[NR-1]}}' | egrep -v "([Uu]nknown|[Pp]utative|[Hh]ypothetical)|[Uu]ncharacterized)" | wc -l; done