# CONSTANTS

In [12]:
FASTQC_PATH = "/home/analytics/distr/FastQC/fastqc"
TRIMMOMATIC_JAR = "/home/analytics/bin/trimmomatic.jar"

# Data fetching

In [1]:
!mkdir -p ../data/
!wget -P ../data/ ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Escherichia_coli/all_assembly_versions/GCA_000005845.2_ASM584v2/GCA_000005845.2_ASM584v2_genomic.fna.gz 2> /dev/null
!wget -P ../data/ ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Escherichia_coli/all_assembly_versions/GCA_000005845.2_ASM584v2/GCA_000005845.2_ASM584v2_genomic.gff.gz 2> /dev/null

In [2]:
!wget -P ../data/ http://public.dobzhanskycenter.ru/mrayko/amp_res_1.fastq.zip 2> /dev/null
!wget -P ../data/ http://public.dobzhanskycenter.ru/mrayko/amp_res_2.fastq.zip 2> /dev/null

In [3]:
!unzip -o ../data/amp_res_1.fastq.zip -d ../data/
!unzip -o ../data/amp_res_2.fastq.zip -d ../data/

Archive:  ../data/amp_res_1.fastq.zip
  inflating: ../data/amp_res_1.fastq  
Archive:  ../data/amp_res_2.fastq.zip
  inflating: ../data/home/mike/UCSD/BIMM185/Week1/amp_res_2.fastq  
  inflating: ../data/amp_res_2.fastq  


# FastQC analysis

In [5]:
!$FASTQC_PATH ../data/amp_res_1.fastq ../data/amp_res_2.fastq 2> /dev/null

Analysis complete for amp_res_1.fastq
Analysis complete for amp_res_2.fastq


In [6]:
from boto.s3 import connect_to_region
from boto import connect_s3
from boto.s3.key import Key

In [7]:
AWS_S3_ROOT = "http://bioinf-workshop.s3-website.eu-central-1.amazonaws.com/"

def perform_file_upload(file_, file_name):

    conn = connect_to_region("eu-central-1", 
                             aws_access_key_id=open("../.access_key", "r").read().strip(),
                             aws_secret_access_key=open("../.secret_key", "r").read().strip())

    bucket = conn.get_bucket("bioinf-workshop")
    key = Key(bucket)
    key.key = file_name
    key.set_contents_from_string(file_, headers={"Content-Type": "text/html"})
    key.set_acl('public-read')
    return AWS_S3_ROOT + file_name # public url to access the uploaded file


In [8]:
f_1 = perform_file_upload(open("../data/amp_res_1_fastqc.html", "r").read(), "week_1/fastqc_1.html")
f_2 = perform_file_upload(open("../data/amp_res_2_fastqc.html", "r").read(), "week_1/fastqc_2.html")

In [9]:
from IPython.display import IFrame

IFrame(src=f_1, width=1000, height=600)

In [10]:
from IPython.display import IFrame

IFrame(src=f_2, width=1000, height=600)

#### Due to presence of '#', ':' etc it's likely that this fastq file has Phred33 scale.

In [24]:
ord(":"), ord("#")

(58, 35)

There are several problems with raw reads:  
* Low quality at the end of reads
* Systematic error with forward reads (bubble or something like this)
* Severely non-uniform distribution of nucleotides at the start of the reads

We can adreess the problems using trimmomatic. After several tries I ended up with the following configuration:
* Do not crop adapters, probably these adapters are not usual
* Crop 20 nucleotides from the start instead
* Use sliding window so as to drop low quality at the end and tile problems
* Minimal length = 30 as a 'rule of thumb'

In [79]:
!java -jar $TRIMMOMATIC_JAR PE \
    -phred33 \
    -trimlog ../data/trim.log \
    ../data/amp_res_1.fastq ../data/amp_res_2.fastq \
    ../data/amp_res_1_p.fastq ../data/amp_res_1_u.fastq ../data/amp_res_2_p.fastq ../data/amp_res_2_u.fastq \
    HEADCROP:20 SLIDINGWINDOW:4:30 MINLEN:30

TrimmomaticPE: Started with arguments: -phred33 -trimlog ../data/trim.log ../data/amp_res_1.fastq ../data/amp_res_2.fastq ../data/amp_res_1_p.fastq ../data/amp_res_1_u.fastq ../data/amp_res_2_p.fastq ../data/amp_res_2_u.fastq HEADCROP:20 SLIDINGWINDOW:4:30 MINLEN:30
Multiple cores found: Using 16 threads
Input Read Pairs: 455876 Both Surviving: 271424 (59.54%) Forward Only Surviving: 53015 (11.63%) Reverse Only Surviving: 43821 (9.61%) Dropped: 87616 (19.22%)
TrimmomaticPE: Completed successfully


In [80]:
!$FASTQC_PATH ../data/amp_res_1_p.fastq ../data/amp_res_2_p.fastq 2> /dev/null

Analysis complete for amp_res_1_p.fastq
Analysis complete for amp_res_2_p.fastq


In [81]:
f_1 = perform_file_upload(open("../data/amp_res_1_p_fastqc.html", "r").read(), "week_1/fastqc_p_1.html")
f_2 = perform_file_upload(open("../data/amp_res_2_p_fastqc.html", "r").read(), "week_1/fastqc_p_2.html")

In [82]:
from IPython.display import IFrame

IFrame(src=f_1, width=1000, height=600)

In [83]:
from IPython.display import IFrame

IFrame(src=f_2, width=1000, height=600)

Now reads look suitable for further analysis.