# Regenerate GVCF Files
```
pi:ababaian
files: ~/Crown/.. (var0
start: 2019 06 19
complete : YYYY MM DD
```
## Introduction

I have been using the `ADcalc*.sh` script to create a common set of GVCF files for downstream analysis. The TCGA and CCLE data was proccessed uniformly.

I will re-process the CRC and hCAGE data for cell lines and primary cells to make GVCF files for analysis. This will make cross cohort analysis simplified.



In [1]:
cd ~/Crown/data2/hCAGE/
ls

2011_hCAGE_protocol.pdf       ADcalc_cage.sh  hCAGE_gvcf.log  screenlog.0
2014_MORAI_workflow_CAGE.pdf  cell_lines      primary         TssHmm_Info.pdf


## Objective

1. Generate `GVCF` files for hCAGE, CRC1 and CRC4 data-sets



## Materials and Methods -- hCAGE

Launch an AMI instance and run ADcalc script on data.


### Initialization

Initialize data on AMI instance, run ADcalc and output data to respective folder

In [None]:
# DNS:ec2-34-221-202-76.us-west-2.compute.amazonaws.com
# AMI: ami-0b375c9c58cb4a7a2 (TCGA aligner)
# Instance: m4.4xlarge
# Storage: 400 Gb

# ON REMOTE:

## Copy hCAGE files into it's dir
#mkdir -p ~/hCAGE; cd hCAGE;
#aws s3 cp --recursive s3://crownproject/hCAGE/ ./

## PATHS:
## ~/hCAGE/cell_lines/BAM/*
## ~/hCAGE/primary/BAM/*

# cd ~/hCAGE/cell_lines/BAM/

## Run ADcalc script.sh
# screen -L
# bash ADcalc_tcga.sh

# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle.gvcf.log

aws s3 cp s3://crownproject/ccle/logs/ccle.gvcf.log ./
cat ccle.gvcf.log

## DONE

### Scripts
`ADcalc_hCAGE.sh` @ s3://crownproject/hCAGE/ADcalc_hCAGE.sh


In [1]:
#!/bin/bash
# ADcalc_hCAGE.sh
# Allelic Depth Calculator
# for a position
#
# s3://crownproject/hCAGE/ADcalc_hcage.sh

# Controls -----------------
DEPTH='100000' #Max per file DP

# Regions in hgr1.fa reference genome
REGIONS=('chr13:1003660-1005529' 'chr13:1005529-1005629' \
        'chr13:10219-10340' 'chr13:1006622-1006779' 'chr13:1007948-1013018')

# Corresponding region/gene names
GENES=('18S' '18SE' '5S' '5.8S' '28S')

# 18S  1870
# 18SE 101
# 5S   122
# 5.8S 158
# 28S  5071

# Terminate instances upon completion (for debugging)
TERMINATE='FALSE'

# S3 Output directory
S3DIR='s3://crownproject/hCAGE/gvcf/'
BAMLIST='bam.list.tmp'

# Script ------------------ ------------------------------
cd ~/hCAGE/cell_lines/
mkdir -p GVCF #Output Folder
TYPE='hCAGE_cell_lines' # hardcode single ccle run
cd BAM

#for TYPE in $(echo "hgr1")
#do
    echo Analyzing $TYPE...
    #cd $TYPE

    ls *.bam > bam.list.tmp
    ls *.bam > ../GVCF/$TYPE.bamlist
          
    for index in ${!GENES[*]}
    do
      printf "Started processing %s\n" ${GENES[$index]}
      OUTPUT="../GVCF/$TYPE.${GENES[$index]}.gvcf"

      # Iterate through every bam file in directory
      # look-up position and return VCF
      bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
        --max-depth $DEPTH -A --min-BQ 30 \
        -a FORMAT/DP,AD \
        -r ${REGIONS[$index]} \
        --ignore-RG \
        -b $BAMLIST | \
        bcftools annotate -x INFO,FORMAT/PL - | \
        bcftools view -O v - \
        >> $OUTPUT

      RESULTS+=("$OUTPUT")
      printf "Done with %s \n" ${GENES[$index]}
      printf "%s\n" ${REGIONS[$index]}

    done

    rm bam.list.tmp

#    cd .. # move to tcga folder to reset
#done

# Copy GVCF output to AWS S3
cd ../GVCF
aws s3 cp --recursive ./ $S3DIR

# -------------------------------------------
cd ~/hCAGE/primary/
mkdir -p GVCF #Output Folder
TYPE='hCAGE_cell_lines' # hardcode single ccle run
cd BAM

#for TYPE in $(echo "hgr1")
#do
    echo Analyzing $TYPE...
    #cd $TYPE

    ls *.bam > bam.list.tmp
    ls *.bam > ../GVCF/$TYPE.bamlist
          
    for index in ${!GENES[*]}
    do
      printf "Started processing %s\n" ${GENES[$index]}
      OUTPUT="../GVCF/$TYPE.${GENES[$index]}.gvcf"

      # Iterate through every bam file in directory
      # look-up position and return VCF
      bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
        --max-depth $DEPTH -A --min-BQ 30 \
        -a FORMAT/DP,AD \
        -r ${REGIONS[$index]} \
        --ignore-RG \
        -b $BAMLIST | \
        bcftools annotate -x INFO,FORMAT/PL - | \
        bcftools view -O v - \
        >> $OUTPUT

      RESULTS+=("$OUTPUT")
      printf "Done with %s \n" ${GENES[$index]}
      printf "%s\n" ${REGIONS[$index]}

    done

    rm bam.list.tmp

#    cd .. # move to tcga folder to reset
#done

# Copy GVCF output to AWS S3
cd ../GVCF
aws s3 cp --recursive ./ $S3DIR


# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
sleep 20s # to catch errors

if [ "$TERMINATE" = TRUE ]
then
  echo "Run Complete -- Shutting down instance."
  aws ec2 terminate-instances --instance-ids $EC2ID
else
  echo "Run Complete -- Instance is online."
fi


Completed 5.7 KiB/8.7 MiB with 23 file(s) remainingupload: ./00_TCGA_Run_template.ipynb to s3://crownproject/hCAGE/gvcf/00_TCGA_Run_template.ipynb
Completed 5.7 KiB/8.7 MiB with 22 file(s) remainingCompleted 11.4 KiB/8.7 MiB with 22 file(s) remainingupload: .ipynb_checkpoints/00_TCGA_Run_template-checkpoint.ipynb to s3://crownproject/hCAGE/gvcf/.ipynb_checkpoints/00_TCGA_Run_template-checkpoint.ipynb
Completed 11.4 KiB/8.7 MiB with 21 file(s) remainingCompleted 17.1 KiB/8.7 MiB with 21 file(s) remainingupload: .ipynb_checkpoints/20190618_RE_GVCF-checkpoint.ipynb to s3://crownproject/hCAGE/gvcf/.ipynb_checkpoints/20190618_RE_GVCF-checkpoint.ipynb
Completed 17.1 KiB/8.7 MiB with 20 file(s) remainingCompleted 57.5 KiB/8.7 MiB with 20 file(s) remainingupload: .ipynb_checkpoints/20190506_TCGA_Run2_Pilot-checkpoint.ipynb to s3://crownproject/hCAGE/gvcf/.ipynb_checkpoints/20190506_TCGA_Run2_Pilot-checkpoint.ipynb
Completed 57.5 KiB/8.7 MiB with 19 file(s) remainingCompleted 61.5 K

In [3]:
# aws s3 cp screenlog.0 s3://crownproject/hCAGE/hCAGE_gvcf.log

aws s3 cp s3://crownproject/hCAGE/hCAGE_gvcf.log ./

cat hCAGE_gvcf.log

## shit, the primary cell data was misnamed (line 84) as cell line and overwrote
## the original run. Will have to re-run for cell lines and rename primary hCAGE data
# on AWS now

Completed 17.3 KiB/17.3 KiB with 1 file(s) remainingdownload: s3://crownproject/hCAGE/hCAGE_gvcf.log to ./hCAGE_gvcf.log
[01;32mubuntu@ip-172-31-17-124[00m:[01;34m~/hCAGE[00m$ ls
[0m[01;32mADcalc_hCAGE.sh[0m  [01;34mcell_lines[0m  [01;34mgvcf[0m  [01;34mprimary[0m  screenlog.0
[01;32mubuntu@ip-172-31-17-124[00m:[01;34m~/hCAGE[00m$ bash ADcalc_hCAGE.sh 
Analyzing hCAGE_cell_lines...
Started processing 18S
[W::hts_idx_load2] The index file is older than the data file: acute_lymphoblastic_leukemia__T-ALL__cell_line_aHPB-ALL.bam.bai
[W::hts_idx_load2] The index file is older than the data file: acute_myeloid_leukemia__FAB_M4__cell_line_aFKH-1.bam.bai
[W::hts_idx_load2] The index file is older than the data file: acute_myeloid_leukemia__FAB_M7__cell_line_aMKPL-1.bam.bai
[W::hts_idx_load2] The index file is older than the data file: B_lymphoblastoid_cell_line_a_GM12878_ENCODE_c_biol_rep3.bam.bai
[W::hts_idx_load2] The index file is older than the data fil

In [4]:
mkdir -p gvcf
aws s3 cp --recursive s3://crownproject/hCAGE/gvcf/ ./gvcf/

Completed 89.7 KiB/89.8 MiB with 12 file(s) remainingdownload: s3://crownproject/hCAGE/gvcf/hCAGE_cell_lines.5S.gvcf to gvcf/hCAGE_cell_lines.5S.gvcf
Completed 89.7 KiB/89.8 MiB with 11 file(s) remainingCompleted 345.7 KiB/89.8 MiB with 11 file(s) remainingCompleted 601.7 KiB/89.8 MiB with 11 file(s) remainingCompleted 613.8 KiB/89.8 MiB with 11 file(s) remainingdownload: s3://crownproject/hCAGE/gvcf/hCAGE_cell_lines.bamlist to gvcf/hCAGE_cell_lines.bamlist
Completed 613.8 KiB/89.8 MiB with 10 file(s) remainingCompleted 869.8 KiB/89.8 MiB with 10 file(s) remainingCompleted 1.1 MiB/89.8 MiB with 10 file(s) remaining  Completed 1.3 MiB/89.8 MiB with 10 file(s) remaining  Completed 1.6 MiB/89.8 MiB with 10 file(s) remaining  Completed 1.8 MiB/89.8 MiB with 10 file(s) remaining  Completed 2.1 MiB/89.8 MiB with 10 file(s) remaining  Completed 2.3 MiB/89.8 MiB with 10 file(s) remaining  Completed 2.6 MiB/89.8 MiB with 10 file(s) remaining  Completed 2.8 MiB/89.8 MiB with 10 f

## Materials and Methods -- CRC

Launch an AMI instance and run ADcalc script on data.

In [None]:
# DNS:XXX
# AMI: ami-0b375c9c58cb4a7a2 (TCGA aligner)
# Instance: m4.4xlarge
# Storage: 400 Gb

# ON REMOTE:

## Copy CRC files into it's dir
#mkdir -p ~/crc; cd crc;
#aws s3 cp --recursive s3://crownproject/crc/ ./

## PATHS:
## ~/hCAGE/cell_lines/BAM/*
## ~/hCAGE/primary/BAM/*

# cd ~/hCAGE/cell_lines/BAM/

## Run ADcalc script.sh
# screen -L
# bash ADcalc_tcga.sh

# aws s3 cp screenlog.0 s3://crownproject/ccle/logs/ccle.gvcf.log

aws s3 cp s3://crownproject/ccle/logs/ccle.gvcf.log ./
cat ccle.gvcf.log

## DONE

In [None]:
#!/bin/bash
# ADcalc_crc.sh
# Allelic Depth Calculator
# for a position
#
# s3://crownproject/hCAGE/ADcalc_crc.sh

# Controls -----------------
DEPTH='100000' #Max per file DP

# Regions in hgr1.fa reference genome
REGIONS=('chr13:1003660-1005529' 'chr13:1005529-1005629' \
        'chr13:10219-10340' 'chr13:1006622-1006779' 'chr13:1007948-1013018')

# Corresponding region/gene names
GENES=('18S' '18SE' '5S' '5.8S' '28S')

# 18S  1870
# 18SE 101
# 5S   122
# 5.8S 158
# 28S  5071

# Terminate instances upon completion (for debugging)
TERMINATE='FALSE'

# S3 Output directory
S3DIR='s3://crownproject/crc/gvcf/'
BAMLIST='bam.list.tmp'

# Script ------------------ ------------------------------
cd ~/crc/
mkdir -p GVCF #Output Folder
TYPE='crc1' # hardcode single ccle run
cd crc1/hgr1/

#for TYPE in $(echo "hgr1")
#do
    echo Analyzing $TYPE...
    #cd $TYPE

    ls *.bam > bam.list.tmp
    ls *.bam > ../../GVCF/$TYPE.bamlist
          
    for index in ${!GENES[*]}
    do
      printf "Started processing %s\n" ${GENES[$index]}
      OUTPUT="../../GVCF/$TYPE.${GENES[$index]}.gvcf"

      # Iterate through every bam file in directory
      # look-up position and return VCF
      bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
        --max-depth $DEPTH -A --min-BQ 30 \
        -a FORMAT/DP,AD \
        -r ${REGIONS[$index]} \
        --ignore-RG \
        -b $BAMLIST | \
        bcftools annotate -x INFO,FORMAT/PL - | \
        bcftools view -O v - \
        >> $OUTPUT

      RESULTS+=("$OUTPUT")
      printf "Done with %s \n" ${GENES[$index]}
      printf "%s\n" ${REGIONS[$index]}

    done

    rm bam.list.tmp

#    cd .. # move to tcga folder to reset
#done

# Copy GVCF output to AWS S3
cd ../../GVCF
aws s3 cp --recursive ./ $S3DIR

# -------------------------------------------
cd ~/crc/
mkdir -p GVCF #Output Folder
TYPE='crc4' # hardcode single ccle run
cd crc4/hgr1/

#for TYPE in $(echo "hgr1")
#do
    echo Analyzing $TYPE...
    #cd $TYPE

    ls *.bam > bam.list.tmp
    ls *.bam > ../../GVCF/$TYPE.bamlist
          
    for index in ${!GENES[*]}
    do
      printf "Started processing %s\n" ${GENES[$index]}
      OUTPUT="../../GVCF/$TYPE.${GENES[$index]}.gvcf"

      # Iterate through every bam file in directory
      # look-up position and return VCF
      bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
        --max-depth $DEPTH -A --min-BQ 30 \
        -a FORMAT/DP,AD \
        -r ${REGIONS[$index]} \
        --ignore-RG \
        -b $BAMLIST | \
        bcftools annotate -x INFO,FORMAT/PL - | \
        bcftools view -O v - \
        >> $OUTPUT

      RESULTS+=("$OUTPUT")
      printf "Done with %s \n" ${GENES[$index]}
      printf "%s\n" ${REGIONS[$index]}

    done

    rm bam.list.tmp

#    cd .. # move to tcga folder to reset
#done

# Copy GVCF output to AWS S3
cd ../../GVCF
aws s3 cp --recursive ./ $S3DIR


# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
sleep 20s # to catch errors

if [ "$TERMINATE" = TRUE ]
then
  echo "Run Complete -- Shutting down instance."
  aws ec2 terminate-instances --instance-ids $EC2ID
else
  echo "Run Complete -- Instance is online."
fi


## Discussion
