# Quality Control using RSeQC to verify sample quality.

Specify cram file, reference genome, area of interest, and nick name for output files.

In [None]:
!cramfile="HG01504.alt_bwamem_GRCh38DH.20150826.IBS.exome.cram"
!fasta="GRCh38_full_analysis_set_plus_decoy_hla.fa"
!area="chr11"
!name="HG1504_chr11"
!genemodel="hg38_RefSeq.bed" # using hg38 gene model
!readlength=101 # depends on sample

Subset data accordingly using samtools

In [None]:
# isolate area
!samtools view -b $cramfile $area -T $fasta > $name.bam

# sort and index resulting bam file
!samtools sort $name.bam -o sorted.$name.bam
!samtools index sorted.$name.bam

# calculate coverage of reads (and other stats)
!samtools stats sorted.$name.bam > $name.stats


Use subsetted, sorted, and indexed bam file for rseqc

# Bam file summary

Create mapping statistic summary. Look for high percentages of mapped and unique reads, and low percentages of multimapped reads.

In [1]:
%%bash
# cd ~/work/data
# name="0531"
# bamfile="$name.bam"
cd ~/work/data/processed
name="bioliquid_run2"
bamfile="bioliquid_run2.bam"
# bamfile="$name.sort.bam"
bam_stat.py -i $bamfile > $name_bam_stat.txt

[W::hts_idx_load3] The index file is older than the data file: bioliquid_run2.bam.bai
Load BAM file ...  Done


View distribution of deletions across reads.

In [2]:
!pwd

/home/jovyan/work/code/bioliquid-nanopore


In [None]:
%%bash
# cd ~/work/data
# name="0531"
# bamfile="0531.sort.bam"
cd ~/work/data/processed
name="bioliquid_run2"
bamfile="bioliquid_run2.bam"
readlength=101

deletion_profile.py -i $bamfile -o $name -l $readlength 
!Rscript $name.deletion_profile.r

View distribution of insertions across reads. Note that deletion distribution shows counts while insertion distribution shows percentages. 

## Insertion Deletions

In [None]:
!insertion_profile.py -i $bamfile -s "PE" -o $name

View GC content distribution across reads. Look for a normal distribution.

## GC Content

In [3]:
%%bash
# cd ~/work/data
# name="0531"
# bamfile="0531.sort.bam"
cd ~/work/data/processed
name="bioliquid_run2"
bamfile="bioliquid_run2.bam"
read_GC.py -i $bamfile -o $name

[W::hts_idx_load3] The index file is older than the data file: bioliquid_run2.bam.bai
Read BAM file ...  Done
writing GC content ...
writing R script ...
/opt/conda/lib/R/bin/exec/R: error while loading shared libraries: libreadline.so.6: cannot open shared object file: No such file or directory


View nucleotide distribution across reads. A slight 5' bias is expected; look for even percentages across A, C, T, and G bases and low numbers of unspecified or unknown bases.

In [None]:
!read_NVC.py -i $bamfile -o $name -x --nx

View read quality (Phred score) distribution. Phred score = -10log10(P), where P = probability of base calling error.

## Read Quality

In [22]:
%%bash
cd ~/work/data
name="0531"
bamfile="0531.sort.bam"
read_quality.py -i $bamfile -o $name

Read BAM file ...  Done
/opt/conda/lib/R/bin/exec/R: error while loading shared libraries: libreadline.so.6: cannot open shared object file: No such file or directory


View coverage across reads. A slight 5' bias is expected; look for a normal distribution.

In [None]:
!geneBody_coverage.py -i $bamfile -r $genemodel -o $name 