# Data Release Update - v200821
```
Lead     : ababaian
Issue    : n/a
start    : 2020 08 17
complete : 2020 08 21
files    : ~/serratus/notebook/200817/
s3 files : s3://serratus-public/notebook/200817/
```

## Introduction

As of the most recent data release of 3.84 million summary files; the `.bam` files are not indexed and not sorted. This was to mitigate issues with disk over-run in `merge` module of the Serratus pipeline but unfortunately this makes the data less accessible.

The `summarizer` has also been updated for better tracking of 'top hits' and a log2 based bin-scoring as opposed to the current relative score module.

To update the current data release; all .bam files need to be downloaded, sorted, indexed (.bai created) and on the same pipe the updated version of the 'summarizer' ran to give better summary data.


### Objectives

- Sort and index all data-release bam files
- Re-summarize all data-release bam files


## Materials and Methods


### System Initialization


In [1]:
# EC2 C5.xlarge instance fired up
date

Mon Aug 17 11:34:19 PDT 2020


In [None]:
# Build containers from github
sudo yum install -y docker
sudo yum install -y git
sudo service docker start

export DOCKERHUB_USER='serratusbio' # optional
sudo docker login # optional

git clone https://github.com/ababaian/serratus.git; cd serratus/containers
./build_containers.sh


In [None]:
screen

sudo docker run --rm --entrypoint /bin/bash \
-it serratusbio/serratus-merge:latest

### Re-summarizer script


In [None]:
#!/bin/bash
# run_resummarize
#
# Base: serratus-merge 

set -eu
PIPE_VERSION="0.3.0"

function usage {
  echo ""
  echo "Usage: run_resummarize.sh -s <SRR accession>' [OPTIONS]"
  echo ""
  echo "    -h    Show this help/usage message"
  echo ""
  echo "    Required Parameters"
  echo "    -s    SRA Accession"
  echo "    -g    Genome identifier [cov3ma] (used for sumzer)"
  echo "    -L    S3 bucket path [s3://lovelywater2]"
  echo ""
  echo "    Merge Parameters"
  echo "    -n    parallel CPU threads to use where applicable  [1]"
  echo ""
  #echo "    Optional outputs"
  #echo "    -i    Flag. Generate bam.bai index file. Requires sort, otherwise false."
  #echo "    -f    Flag. Generate flagstat summary file"
  #echo "    -r    Flag. Sort final bam output (requires double disk usage)"
  #echo ""
  #echo ""
  echo "    Output options"
  echo "    -o    <output_filename_prefix> [Defaults to SRA_ACCESSION]"
  echo "    -O    Output S3 bucket [s3://serratus-bio]"
  echo ""
  echo "    Outputs a sorted Uploaded to s3: "
  echo "          <output_prefix>.bam, <output_prefix>.bam.bai, <output_prefix>.flagstat"
  echo ""
  echo "ex: bash run_resummarize.sh -s 'SRR123'"
  exit 1
}


# PARSE INPUT =============================================
# Generate random alpha-numeric for run-id
#RUNID=$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 8 | head -n 1 )

# Run Parameters
SRA=''
GENOME='cov3ma'
S3='s3://lovelywater2'

# Merge Options
THREADS='1'
#INDEX='negative'
#FLAGSTAT='negative'
#SORT='negative'

# Script Arguments -M
MERGE_ARGS=''

# Output options -do
BASEDIR="/home/serratus"
OUTNAME=''
S3_OUT='s3://serratus-bio'


while getopts s:L:o:O:nifrh FLAG; do
  case $FLAG in
    s)
      SRA=$OPTARG
      ;;
    o)
      OUTNAME=$OPTARG
      ;;
    g)
      GENOME=$OPTARG
      ;;
    # Merge Options ---------
    n)
      THREADS=$OPTARG
      ;;
    i)
      INDEX="true"
      ;;
    f)
      FLAGSTAT="true"
      ;;
    r)
      SORT="true"
      ;;
    h)  #show help ----------
      usage
      ;;
    \?) #unrecognized option - show help
      echo "Input parameter not recognized"
      usage
      ;;
  esac
done
shift $((OPTIND-1))

# Check inputs --------------
# Required parameters
if [ -z "$SRA" ]
then
  echo "-s <SRA / Outname> required"
  usage
fi

if [ -z "$OUTNAME" ]
then
  OUTNAME="$SRA"
fi

# Final Output Bam File name
OUTBAM="$OUTNAME.bam"


# SCRIPT ===================================================
# Command to run summarizer script
sumzer="$GENOME.sumzer.tsv"

if [ ! -e "$GENOME.sumzer.tsv" ]; then
        echo "  $GENOME.sumzer.tsv not found. Attempting download from"
        echo "  $S3/seq/$GENOME/$GENOME.sumzer.tsv"

        aws s3 cp $S3/seq/$GENOME/$GENOME.sumzer.tsv ./
fi


# Meta-data header for summary file
export SUMZER_COMMENT=$(echo sra="$SRA",genome="$GENOME",version=200818,date=$(date +%y%m%d-%R))

# Summary Comment / Meta-data
# usage: serratus_summarizer_flom.py InputSamFileName MetaTsvFilename SummaryFileName OutputSamFileName
#summarizer="python3 /home/serratus/serratus_summarizer.py /dev/stdin $sumzer $SRA.summary /dev/stdout"
summarizer="python3 /home/serratus/serratus_summarizer.py /dev/stdin $sumzer $SRA.summary /dev/null"

# Acquire + Run -----------------------
# Download bam file
aws s3 cp $S3/bam/$SRA.bam ./$SRA.unsorted.bam

# Summarize v2
samtools view $SRA.unsorted.bam | \
$summarizer 

# Sort
samtools sort -@ $THREADS $SRA.unsorted.bam >\
$OUTBAM

# index
samtools index $OUTBAM

# Upload ------------------------------
if [[ -s "$SRA.bam" ]]; then
  aws s3 cp --only-show-errors $SRA.bam $S3_OUT/bam/
fi

if [[ -s "$SRA.bam.bai" ]]; then
  aws s3 cp --only-show-errors $SRA.bam.bai $S3_OUT/bam/
fi

if [[ -s "$SRA.summary" ]]; then
  aws s3 cp --only-show-errors $SRA.summary $S3_OUT/summary/
fi

# Clean-up
rm $SRA.unsorted.bam $OUTBAM $OUTBAM.bai $SRA.summary

# end of script

### Summarizer wrapper


In [None]:
# Within serratus-merge container
yum install -y bzip2 tar make

# Install GNU parallel
(wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
     fetch -o - http://pi.dk/3 ) > install.sh
bash install.sh; rm install.sh

In [None]:
# Download + parse bam index
aws s3 cp s3://lovelywater2/index.tsv ./

sed 's/...............................//g' index.tsv \
  | sed 's/.summary//g' - \
  > sra.list

# bash-4.2# wc -l sra.list 
# 3837755 sra.list
cmhod 755 run_resummarize.sh

yes ./run_resummarize.sh -n 4 -s \
  | head -n 3837755 \
  | paste - sra.list \
  > resummarize.cmd

In [None]:
# preload sumzer
aws s3 cp s3://lovelywater2/seq/cov3ma/cov3ma.sumzer.tsv ./
chmod 755 run_resummarize.sh

cat resummarize.cmd | parallel -j20

## Iteration 2

Ran 3x C5.12xlarge instances for ~24 hours and got 500k done; will run 25x C5.6xlage to speed things up

Need to inventory completed runs; create new list and 

In [None]:
# Set-up 25x todo lists
aws s3 ls s3://serratus-bio/summary/ > complete.runs

sed 's/...............................//g'  complete.runs \
  | sed 's/.summary//g' - \
  > complete.sra
  
mkdir todo
comm -13 <(sort complete.sra) <(sort sra.list) > todo/todo.list

shuf todo.list > todo.shuf

yes ./run_resummarize.sh -n 4 -s \
  | head -n 3270931 \
  | paste - todo.shuf \
  > todo.cmd

split -n25 todo.cmd -d todo.

aws s3 sync ./ s3://serratus-bio/todo/

In [None]:
screen

# Run-script
sudo yum install -y docker
sudo yum install -y git
sudo service docker start

sudo docker run --rm --entrypoint /bin/bash \
-it serratusbio/serratus-merge:latest

# Within serratus-merge container
yum install -y bzip2 tar make
ss
# Install GNU parallel
(wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
     fetch -o - http://pi.dk/3 ) > install.sh
bash install.sh; rm install.sh

aws s3 cp s3://lovelywater2/seq/cov3ma/cov3ma.sumzer.tsv ./
aws s3 cp s3://serratus-bio/todo/todo.07 ./

cat todo.* | parallel -j20

echo done

In [None]:
# Clean-up
screen

# Run-script
sudo yum install -y docker
sudo yum install -y git
sudo service docker start

sudo docker run --rm --entrypoint /bin/bash \
-it serratusbio/serratus-merge:latest

# Within serratus-merge container
yum install -y bzip2 tar make
ss
# Install GNU parallel
(wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
     fetch -o - http://pi.dk/3 ) > install.sh
bash install.sh; rm install.sh

aws s3 cp s3://lovelywater2/seq/cov3ma/cov3ma.sumzer.tsv ./
aws s3 cp s3://lovelywater2/index.tsv ./
aws s3 ls s3://serratus-bio/summary/ > complete.list


# Make SRA list
sed 's/.* //g' complete.list \
  | sed 's/.summary//g' - \
  > complete.sra
  
sed 's/.* //g' index.tsv \
  | sed 's/.summary//g' - \
  > index.sra
  
comm -13 <(sort complete.sra) <(sort index.sra) > todo/todo.list

yes ./run_resummarize.sh -n 4 -s \
  | head -n 2387 \
  | paste - todo.list \
  > todo.cmd
  
cat todo.cmd | parallel -j20

In [None]:
# on lovelywater2 clean-up
yes aws s3api put-object-acl --acl "public-read" --bucket lovelywater2 \
  --key summary/ | head -n 2387 > summary.1

## Update lovelywater2

In [None]:
# Log-in as lovelywater2 IAM
aws configure set default.s3.max_concurrent_requests 100
aws s3 sync --quiet --acl "public-read" \
  s3://serratus-bio/bam/ \
  s3://lovelywater2/bam/