# Chapter 18 – BLASTing Forensic PCR Primers
In this project, our aim is to query CODIS primer sequences within the human genome. CODIS, or the Combined DNA Index System, is a program in the United States utilized for law enforcement purposes, aiding in crime-solving by identifying and linking DNA evidence. Forensic laboratories contribute and compare DNA profiles electronically at various levels: the Local DNA Index System (LDIS), the State DNA Index System (SDIS), and the National DNA Index System (NDIS). We will be conducting BLAST+ searches with CODIS primer sequences against several human genomic sequences, including the human reference genome, a Caucasian individual for the Ashkenazi Human Reference Genome project, and a female Sumatran orangutan. With the virtual PCR amplicons, we will proceed to create a multiple sequence alignment and generate a guide tree. The guide tree is computed from the distance matrix generated from pairwise alignment scores. We are going to use Clustal Omega to perform the alignment and NJplot to visualize the resulting phylogenetic tree. 

## Installations 

- `sudo apt-get install -y ncbi-entrez-direct`
- `sudo apt install -y ncbi-blast+`

## Preparations

- `mkdir CompBiol2023`
- `cd CompBiol2023`

### Download human chomosome 3

- `efetch -db nuccore -id NC_000003 -format fasta > hs-chr3.fasta`

### Download D3S1358 primer pair

- goto: https://strbase-archive.nist.gov/str_D3S1358.htm
- create FASTA file called *D3S1358.primers*

```
>f-primer
ACTGCAGTCCAATCTGGGT
>r-primer
ATGAAATCAACAGAGGCTTG
```

## Look for primers in chromosome

- `egrep --color 'ACTGCAGTCCAATCTGGGT' hs-chr3.fasta `
- `egrep --color 'ATGAAATCAACAGAGGCTTG' hs-chr3.fasta `

## Setup BLAST database

- `makeblastdb -in hs-chr3.fasta -dbtype nucl -out hsc3`

## BLASTing

- `time blastn -query D3S1358.primers -db hsc3 -out d3s1358-vs-chr3.txt`
- `less d3s1358-vs-chr3.txt`
    - take a look into the file with `less`
    - there are no hits do to short sequence BLASTing

- `blastn -query D3S1358.primers -db hsc3 -out d3s1358-vs-chr3.txt -task blastn-short`
    - adjust BLASTN algorithm
    - take a look into the file

- `blastn -query D3S1358.primers -db hsc3 -out d3s1358-vs-chr3.tab -task blastn-short -outfmt 6`
    - adust BLAST output to tabular
    - take a look into the file

- `awk '$3==100 && $4>=19{print $1, $9, $10}' d3s1358-vs-chr3.tab`
    - print only good matches

## Retrieve amplicon sequence

- `blastdbcmd -db hsc3 -entry all -range 45540713-45540843`
- `blastdbcmd -db hsc3 -entry all -range 45540713-45540843 | sed '1d' | sed 's/TCTA/ & /g' | egrep --color '(ATGAAATCAACAGAGGCTTGC|ACCCAGATTGGACTGCAGT)'`
    - highlight STR repeats

## TASK
Download chromosome 3 sequences for other individuals and orangutan (Pongo)

- `efetch -db nuccore -id CM000553 -format fasta > hs-chr3-pongo.fasta`
- `efetch -db nuccore -id AP023463 -format fasta > hs-chr3-japanese.fasta`
- `efetch -db nuccore -id CM021570 -format fasta > hs-chr3-ashkenazi.fasta`
- `efetch -db nuccore -id CH003498 -format fasta > hs-chr3-venter.fasta`
- `efetch -db nuccore -id NC_000003 -format fasta > hs-chr3.fasta`

Repeat the above proceedure and describe your observations

---

# Automating Primer Download

## Installation

- `sudo apt install pandoc`

## Process the website

- `pandoc -f html -t plain https://strbase-archive.nist.gov/PP16primers.htm`
- `pandoc -f html -t plain https://strbase-archive.nist.gov/PP16primers.htm | egrep  '(Pair |[ATGC]{10,})' `
- `pandoc -f html -t plain https://strbase-archive.nist.gov/PP16primers.htm | egrep  '(Pair |[ATGC]{10,})' | awk '{if (NR%3 == 1) {id=$1} else if (NR%3 == 2) {print ">f_"id; print $0} else if (NR%3 == 0) {print ">r_"id; print $0}}'`

## Download all primer pairs into individual FASTA files

- `pandoc -f html -t plain https://strbase-archive.nist.gov/PP16primers.htm | egrep  '(Pair |[ATGC]{10,})' | awk '{if (NR % 3 == 1) {id=$1} else if (NR % 3 == 2) {print ">f_"id > id".fasta"; print $0 >> id".fasta"} else if (NR % 3 == 0) {print ">r_"id >> id".fasta"; print $0 >> id".fasta"}}'`

---

# Automatation of the Process for Marker D21S11

## Download chromosomes
- `efetch -db nuccore -id CM000571 -format fasta > chr21-pongo.fasta`
- `efetch -db nuccore -id AP023481 -format fasta > chr21-japanese.fasta`
- `efetch -db nuccore -id CM021588 -format fasta > chr21-ashkenazi.fasta`
- `efetch -db nuccore -id CH003516 -format fasta > chr21-venter.fasta`
- `efetch -db nuccore -id NC_000021 -format fasta > chr21-ref.fasta`

## Create BLAST databases

- `for i in chr21*fasta; do base=$(basename $i .fasta); makeblastdb -in $i -dbtype nucl -out $base; done`

## BLAST primer and create tabular output

- `for i in chr21*fasta; do base=$(basename $i .fasta); blastn -query D21S11.fasta -db $base -out $base.tab -task blastn-short -outfmt 6; done`

## Create function to extract primer pairs

- `primerpair() { awk '{if($1~/f_/ && $3==100){forward[NR]=$9}else if($1~/r_/ && $3==100){reverse[NR]=$9}}END{for(x in forward){for(y in reverse){z=forward[x]-reverse[y]; if(z<0){z=z*-1}; if(z<400){gsub(/\.tab/,"",FILENAME); print FILENAME, reverse[y],forward[x]}}}}' $1; }`

## Use function to extract primer matches

- `for i in chr21*tab; do primerpair $i; done`

## Retrieve amplicon sequences

- `for i in chr21*tab; do base=$(basename $i .tab); echo $i; primerpair $i | awk '{cmd="blastdbcmd -db "$1" -entry all -range "$3"-"$2" | sed '1d'"; system(cmd)}'; done`

- `for i in chr21*tab; do base=$(basename $i .tab); echo $i; primerpair $i | awk '{cmd="blastdbcmd -db "$1" -entry all -range "$3"-"$2" | sed '1d'"; system(cmd)}' | grep TATC; done`
    - extract repeat length and highlight

- `for i in chr21*tab; do base=$(basename $i .tab); echo $i; primerpair $i | awk '{cmd="blastdbcmd -db "$1" -entry all -range "$3"-"$2" | sed '1d'"; system(cmd)}' | grep -o TATC | wc -l; done`
    - extract repeat length and count
