# 1000 Genomes Alignment to hgr1 -- version 2
```
pi:ababaian
files: ~/Crown/data/1kg2_hgr1/
start: 2019 10 22
complete : XXXX XX XX
```
## Introduction

I'll be re-visiting 1000 genomes project data. I've been focusing on RNA-seq/cancer but I need a good cohort of normal patients. The objective is to re-analyze the complete set of 1000 genomes data under a unified pipeline.


## Objective

**Previous**
- [x] Align the PCR-Free, deep WGS data from CEPH-1436 trio to `hgr1`
- [x] Align 100 low-coverage genomes from all 1000 genomes populations to `hgr1`

**Current**
- [ ] Aign all 1000 genomes data to `hgr1`


## Materials and Methods

~ Data-sets
~ Scripts

### Data-sets

Data-sets are cataloged in `~/Crown/data/1kg2_hgr1/1kg_hgr1_v2_datasets.xlsx`.

#### 1000 Genomes data

- From the 1kg [sequence.index](https://s3.amazonaws.com/1000genomes/sequence.index) (accessed 191022), samples were filtered for:
```
INSTRUMENT_PLATFORM: ILLUMINA
INSTRUMENT_MODEL: Illumina HiSeq 2000
LIBRARY_LAYOUT: PAIRED
WITHDRAWN: 0
READ_COUNT: >1,000,000
ANALYSIS_GROUP: low coverage
```
This yields `25949` files from `1917` independent sample_names (people) in `filtered.seq` sheet. File order was randomized to equally sample across patients, I'll process as much as I can in the next few days.


In [3]:
# Initialize
WORKDIR='/home/artem/Desktop/Crown/data/1kg2_hgr1'
cd $WORKDIR

# Amazon AWS S3 Home URL
S3URL='s3://crownproject2/1kg'




### Scripts

`hgr1_align_v4.1kg.sh` - Core alignment script for hgr1 alignment and VCF
`queenB.sh` - head node control script
`droneB.sh` - worker node control script


In [16]:
cat hgr1_align_v4.1kg.sh
echo '---------------------'
cat droneB.sh
echo '---------------------'
cat queenB.sh

#!/bin/bash
# hgr1_align_v4.1kg.sh
# rDNA alignment pipeline - S3 version
PIPE_VERSION='191022 build -- 1000 genomes'
AMI_VERSION='crown-190601 - ami-0b375c9c58cb4a7a2'
# EC2: c4.2xlarge (8cpu / 15 gb)
# EC2: c4.xlarge  (4cpu / 8  gb)
# Storage: 200 Gb
#

# Input Requirements --------------------------

# $1 : Library name + Output name(unique)
# $2 : Seq-read type (wgs|rna)
# $3 : BioSample ID
# $4 : Library SRA Accession

# Control Panel -------------------------------
# Amazon AWS S3 Home URL
  S3URL='s3://crownproject2/1kg'

# CPU
	THREADS='3'

# Terminate instances upon completion (for debuggin)
  TERMINATE='TRUE'
    
# Read Group Data
  LIBRARY=$1    # Library Name / File prefix / patient ID
  TYPE=$2       # wgs OR rna data-type (using crc5 here)
	RGPO=$3  # Patient Population - CPTAC
	RGSM=$4       # Sample ID
	RGID=$5       # Read Group ID. SRA Accession Number
  RGLB=$6  # Library Name. Accession Number
  RGPL='ILLUMINA'   # Seq Platform
  

### Pilot Alignments

Run 2 samples to pilot the pipe


In [11]:
INPUT='1kg_pilot.input'

# Note the different column requirements from CCLE
cat $INPUT

SRR596531	1kg	ESN	HG03133	SRA059953	SRS344086	SRP015238	data/HG03133/sequence_read/SRR596531_1.filt.fastq.gz	data/HG03133/sequence_read/SRR596531_2.filt.fastq.gz
SRR582610	1kg	GWD	HG02461	SRA059330	SRS290867	SRP001518	data/HG02461/sequence_read/SRR582610_1.filt.fastq.gz	data/HG02461/sequence_read/SRR582610_2.filt.fastq.gz


In [15]:
# Local Folder Operations -----------------------------
# LOCAL:
cd $WORKDIR

#NOTE For pilot run, AWS s3 shutdown commented out. Re-upload hgr1 script upon full run

aws s3 cp queenB.sh $S3URL/scripts/ --acl bucket-owner-full-control
aws s3 cp droneB.sh $S3URL/scripts/ --acl bucket-owner-full-control
aws s3 cp hgr1_align_v4.1kg.sh $S3URL/scripts/ --acl bucket-owner-full-control
aws s3 cp $INPUT $S3URL/scripts/ --acl bucket-owner-full-control


Completed 4.5 KiB/4.5 KiB with 1 file(s) remainingupload: ./queenB.sh to s3://crownproject2/1kg/scripts/queenB.sh
Completed 657 Bytes/657 Bytes with 1 file(s) remainingupload: ./droneB.sh to s3://crownproject2/1kg/scripts/droneB.sh
Completed 5.7 KiB/5.7 KiB with 1 file(s) remainingupload: ./hgr1_align_v4.1kg.sh to s3://crownproject2/1kg/scripts/hgr1_align_v4.1kg.sh
Completed 158.4 KiB/158.4 KiB with 1 file(s) remainingupload: ./1kg_batch1.input to s3://crownproject2/1kg/scripts/1kg_batch1.input


In [7]:
# start
date
date -u

Tue Oct 22 21:43:05 PDT 2019
Wed Oct 23 04:43:05 UTC 2019


In [14]:
INPUT='1kg_batch1.input'
aws s3 cp $INPUT $S3URL/scripts/ --acl bucket-owner-full-control
echo ''

# Note the different column requirements from CCLE
cat $INPUT

Completed 158.4 KiB/158.4 KiB with 1 file(s) remainingupload: ./1kg_batch1.input to s3://crownproject2/1kg/scripts/1kg_batch1.input

SRR792212	1kg	CEU	NA11920	SRA070622	SRS000050	SRP000547	data/NA11920/sequence_read/SRR792212_1.filt.fastq.gz	data/NA11920/sequence_read/SRR792212_2.filt.fastq.gz
SRR796796	1kg	PUR	HG01095	SRA071537	SRS010767	SRP001525	data/HG01095/sequence_read/SRR796796_1.filt.fastq.gz	data/HG01095/sequence_read/SRR796796_2.filt.fastq.gz
ERR240351	1kg	JPT	NA18965	ERA201497	SRS000167	SRP000544	data/NA18965/sequence_read/ERR240351_1.filt.fastq.gz	data/NA18965/sequence_read/ERR240351_2.filt.fastq.gz
SRR588347	1kg	GWD	HG02594	SRA059330	SRS290891	SRP001518	data/HG02594/sequence_read/SRR588347_1.filt.fastq.gz	data/HG02594/sequence_read/SRR588347_2.filt.fastq.gz
ERR184487	1kg	STU	HG03697	ERA169315	SRS352873	SRP015242	data/HG03697/sequence_read/ERR184487_1.filt.fastq.gz	data/HG03697/sequence_read/ERR184487_2.filt.fastq.gz
SRR584024	1kg	GWD	HG02628	SRA059330	SRS290903	SRP

In [None]:
# Remote EC2 Instance Operations ----------------------

# Remote:
# Manually open an Amazon Linux 2 AMI
# ami-061392db613a6357b
# t2.micro
#
# ssh login:
# ssh -i "<key>.pem" ec2-user@PUBLICDNS
#

# Commands on EC2 machine to set-up AWS
# enter personal login info:

# REMOTE:
#aws configure
  # AWS Key ID
  # AWS Secret Key ID
  # Region: us-west-2
  
# Copy local run files to S3 and download them on EC2

# REMOTE:
# aws s3 cp --recursive s3://crownproject2/1kg/scripts/ ./
#
# mv <KEY>.pem ~/.ssh/
# chmod 400 ~/.ssh/<KEY>.pem

# REMOTE:
# Open logging screen and being launchign EC2 instances
# screen -L
# 
# bash queenB.sh 1kg_pilot.input
#
# aws s3 cp screenlog.0 s3://crownproject2/1kg/logs/1kg_pilot.log

aws s3 cp s3://crownproject2/1kg/logs/1kg_pilot.log ./
cat 1kg_pilot.log
date -u

# Run completed successfully!

Pilot ran successfully. Ran ~876 files while AWS got it's buisness together. I have ~36 hours to go, ramp up.

AWS was approved, I will prioritize files based on total read count, counting down. The files were broken down futher into 5 batches (2-6) to run concurrently.


In [18]:
ls -alh inputs/*


-rw-r--r-- 1 artem artem 159K Oct 23 00:09 inputs/1kg_batch1.input
-rw-rw-r-- 1 artem artem 794K Oct 24 23:49 inputs/1kg_batch2.input
-rw-rw-r-- 1 artem artem 794K Oct 24 23:46 inputs/1kg_batch3.input
-rw-rw-r-- 1 artem artem 794K Oct 24 23:47 inputs/1kg_batch4.input
-rw-rw-r-- 1 artem artem 793K Oct 24 23:47 inputs/1kg_batch5.input
-rw-rw-r-- 1 artem artem 793K Oct 24 23:49 inputs/1kg_batch6.input
-rw-rw-r-- 1 artem artem  324 Oct 22 21:20 inputs/1kg_pilot.input
-rw-r--r-- 1 artem artem 4.1M Oct 23 00:08 inputs/1kg_total.input


In [20]:
# Local Folder Operations -----------------------------
# LOCAL:
cd $WORKDIR

#NOTE For pilot run, AWS s3 shutdown commented out. Re-upload hgr1 script upon full run

aws s3 cp queenB.sh $S3URL/scripts/ --acl bucket-owner-full-control
aws s3 cp droneB.sh $S3URL/scripts/ --acl bucket-owner-full-control
aws s3 cp hgr1_align_v4.1kg.sh $S3URL/scripts/ --acl bucket-owner-full-control
aws s3 cp --recursive inputs/ $S3URL/scripts/ --acl bucket-owner-full-control


Completed 4.5 KiB/4.5 KiB with 1 file(s) remainingupload: ./queenB.sh to s3://crownproject2/1kg/scripts/queenB.sh
Completed 657 Bytes/657 Bytes with 1 file(s) remainingupload: ./droneB.sh to s3://crownproject2/1kg/scripts/droneB.sh
Completed 5.7 KiB/5.7 KiB with 1 file(s) remainingupload: ./hgr1_align_v4.1kg.sh to s3://crownproject2/1kg/scripts/hgr1_align_v4.1kg.sh

The user-provided path 1kg_batch1.input does not exist.
Completed 324 Bytes/8.0 MiB with 8 file(s) remainingupload: inputs/1kg_pilot.input to s3://crownproject2/1kg/scripts/1kg_pilot.input
Completed 324 Bytes/8.0 MiB with 7 file(s) remainingCompleted 256.3 KiB/8.0 MiB with 7 file(s) remainingCompleted 512.3 KiB/8.0 MiB with 7 file(s) remainingCompleted 768.3 KiB/8.0 MiB with 7 file(s) remainingCompleted 1.0 MiB/8.0 MiB with 7 file(s) remaining  Completed 1.3 MiB/8.0 MiB with 7 file(s) remaining  Completed 1.4 MiB/8.0 MiB with 7 file(s) remaining  upload: inputs/1kg_batch1.input to s3://crownproject2/1kg/scr