# TCGA GVCF File Generation
```
pi:ababaian
files: ~/Crown/data2/tcga2_gvcf/
start: 2019 05 19
complete : 2019 05 23
```
## Introduction

With the completion of ~9500 TCGA2 alignments, there are near 10K patient RNA-seq files with hgr1 alignments complete. Data checkpoint is called `190520`.

Generate an EC2 AMI with the entire TCGA data set. Using this AMI generate `GVCF` files for the main regions of interest for GOAT analysis.


## Objective


## Materials and Methods

Open EC2 Instance for Crown Data
    AMI: crown-tcga-181124 (ami-053dfb448b82492ac)
    Instance Type: r4.large
    Storage: 2500 Gb
    DNS: ec2-54-184-85-170.us-west-2.compute.amazonaws.com


In [None]:
# ON REMOTE:

# Copy TCGA2 files on top of TCGA directory on the Crown Data filesystem.
cd tcga/
aws s3 cp --recursive s3://crownproject/tcga2/ ./

## SAVE AMI AS:
#
# NAME: crown-tcga-190520
# AMI: ami-0235dec7fdd4dc830
# DESC: TCGA DATA
# INSTANCE TYPE: M4.xlarge


In [1]:
# Launch instance of crown-tcga-190520

## REMOTE:

# cd ~/tcga/

## Copy over file list to s3
# ls -alr ./* > tcga_total.filelist
# aws s3 cp tcga_total.filelist s3://crownproject/tcga2/gvcf_190520/

## Run a pilot (ACC) ADcalc_tcga.sh script (below)
# cd ~
# aws s3 cp s3://crownproject/tcga2/scripts/ADcalc_tcga.sh ./
## VIM edit to fit group
## Group A: ec2-54-218-168-107.us-west-2.compute.amazonaws.com DONE
## Group B: ec2-54-187-164-194.us-west-2.compute.amazonaws.com DONE
## Group C: ec2-34-217-174-2.us-west-2.compute.amazonaws.com +! (Killed)
## Group C2: ec2-34-218-254-165.us-west-2.compute.amazonaws.com DONE
## Group D: ec2-54-202-83-77.us-west-2.compute.amazonaws.com DONE
## Group E: ec2-54-213-224-60.us-west-2.compute.amazonaws.com DONE

## Group C3: ec2-52-38-21-94.us-west-2.compute.amazonaws.com + LUSC MESO OV
## Group C4: ec2-54-244-212-118.us-west-2.compute.amazonaws.com DONE
# screen -L
# bash ADcalc_tcga.sh




#### Initial Pass

| TCGA Project Code | Cancer Type                                                      | Complete | 
|-------------------|------------------------------------------------------------------|----------| 
| LAML              | Acute Myeloid Leukemia                                           | Y        | 
| ACC               | Adrenocortical carcinoma                                         | Y        | 
| BLCA              | Bladder Urothelial Carcinoma                                     | Y        | 
| LGG               | Brain Lower Grade Glioma                                         | Y        | 
| BRCA              | Breast invasive carcinoma                                        | Y        | 
| CESC              | Cervical squamous cell carcinoma and endocervical adenocarcinoma | Y        | 
| CHOL              | Cholangiocarcinoma                                               | Y        | 
| COAD              | Colon adenocarcinoma                                             | Y        | 
| ESCA              | Esophageal carcinoma                                             | Y        | 
| GBM               | Glioblastoma multiforme                                          | Y        | 
| HNSC              | Head and Neck squamous cell carcinoma                            | Y        | 
| KICH              | Kidney Chromophobe                                               | Y        | 
| KIRC              | Kidney renal clear cell carcinoma                                | Y        | 
| KIRP              | Kidney renal papillary cell carcinoma                            | Y        | 
| LIHC              | Liver hepatocellular carcinoma                                   | Y        | 
| LUAD              | Lung adenocarcinoma                                              | Y        | 
| LUSC              | Lung squamous cell carcinoma                                     | Y        | 
| DLBC              | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma                  | Y        | 
| MESO              | Mesothelioma                                                     | Y        | 
| OV                | Ovarian serous cystadenocarcinoma                                | Y        | 
| PAAD              | Pancreatic adenocarcinoma                                        | Y        | 
| PCPG              | Pheochromocytoma and Paraganglioma                               | Y        | 
| PRAD              | Prostate adenocarcinoma                                          | Y        | 
| READ              | Rectum adenocarcinoma                                            | Y        | 
| SARC              | Sarcoma                                                          | Y        | 
| SKCM              | Skin Cutaneous Melanoma                                          | Y        | 
| STAD              | Stomach adenocarcinoma                                           | Y        | 
| TGCT              | Testicular Germ Cell Tumors                                      | Y        | 
| THYM              | Thymoma                                                          | Y        | 
| THCA              | Thyroid carcinoma                                                | Y        | 
| UCS               | Uterine Carcinosarcoma                                           | Y        | 
| UCEC              | Uterine Corpus Endometrial Carcinoma                             | Y        | 
| UVM               | Uveal Melanoma                                                   | Y        | 


In [None]:
#!/bin/bash
# ADcalc_tcga.sh
# Allelic Depth Calculator
# for a position
#
# s3://crownproject/tcga2/scripts/ADcalc_tcga.sh

# Controls -----------------
DEPTH='100000' #Max per file DP

# Regions in hgr1.fa reference genome
REGIONS=('chr13:1003660-1005529' 'chr13:1005529-1005629' \
        'chr13:10219-10340' 'chr13:1006622-1006779' 'chr13:1007948-1013018')

# Corresponding region/gene names
GENES=('18S' '18SE' '5S' '5.8S' '28S')

# 18S  1870
# 18SE 101
# 5S   122
# 5.8S 158
# 28S  5071

# Terminate instances upon completion (for debugging)
TERMINATE='FALSE'

# S3 Output directory
S3DIR='s3://crownproject/tcga2/gvcf_190520/'

# Script ------------------
BAMLIST='bam.list.tmp'

cd ~/tcga/
mkdir -p GVCF #Output Folder

# Iterate through TCGA Cancer Cohorts
# Pilot Test  -------------
for TYPE in $(echo "TCGA-XXX")
# Group A -----------------
#for TYPE in {'TCGA-ACC','TCGA-BLCA','TCGA-BRCA','TCGA-CESC',\
#'TCGA-CHOL','TCGA-COAD','TCGA-DLBC','TCGA-ESCA'}
# Group B -----------------
#for TYPE in {'TCGA-GBM','TCGA-HNSC','TCGA-KICH','TCGA-KIRC',\
#'TCGA-KIRP','TCGA-LAML','TCGA-LGG','TCGA-LIHC'}
# Group C -----------------
#for TYPE in {'TCGA-LUAD','TCGA-LUSC','TCGA-MESO','TCGA-OV',\
#'TCGA-PAAD','TCGA-PCPG','TCGA-PRAD'}
# Group D -----------------
#for TYPE in {'TCGA-READ','TCGA-SARC','TCGA-SKCM','TCGA-STAD',\
#'TCGA-TGCT','TCGA-THCA','TCGA-THYM','TCGA-UCEC','TCGA-UCS','TCGA-UVM'}
do
    echo Analyzing $TYPE...
    cd $TYPE

    ls *.bam > bam.list.tmp
    ls *.bam >> ../GVCF/$TYPE.bamlist
    #ls *.bam >> ../GVCF/tcga.bamlist
           
    for index in ${!GENES[*]}
    do
      printf "Started processing %s\n" ${GENES[$index]}
      OUTPUT="../GVCF/$TYPE.${GENES[$index]}.gvcf"

      # Iterate through every bam file in directory
      # look-up position and return VCF
      bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
        --max-depth $DEPTH -A --min-BQ 30 \
        -a FORMAT/DP,AD \
        -r ${REGIONS[$index]} \
        --ignore-RG \
        -b $BAMLIST | \
        bcftools annotate -x INFO,FORMAT/PL - | \
        bcftools view -O v - \
        >> $OUTPUT

      RESULTS+=("$OUTPUT")
      printf "Done with %s \n" ${GENES[$index]}
      printf "%s\n" ${REGIONS[$index]}

    done

    rm bam.list.tmp

    cd .. # move to tcga folder to reset
done

# Copy GVCF output to AWS S3
cd GVCF
aws s3 cp --recursive ./ $S3DIR

# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
sleep 20s # to catch errors

if [ "$TERMINATE" = TRUE ]
then
  echo "Run Complete -- Shutting down instance."
  aws ec2 terminate-instances --instance-ids $EC2ID
else
  echo "Run Complete -- Instance is online."
fi


#### Errors arising

In processing BRCA, there was an error and the output files were empty. Will remove this library and re-run on seperate node. Delete empty GVCF files from Group A node.

Error Message: `[mpileup] fail to load index for TCGA-E9-A1RC-11A.hgr1.bam`

Ran Group E below.

Error Message 2: `[mpileup] fail to load index for TCGA-E9-A1RD-01A.hgr1.bam`
Error Message 3: `[mpileup] fail to load index for TCGA-E9-A1RD-11A.hgr1.bam`
...

The error may be coming from running too many bam files. The error keeps happening at about 1020 bam files in. I will modify the script to break up the bam list into BRCA1 and BRCA2 gvcf files.

The following lines were inserted after line 52
```
    # lazy 600 file limiter
    ls *.bam | head -n600 - > bam.list.tmp
    TYPE='TCGA-BRCA1'
    
    #ls *.bam | tail -n +601 - > bam.list.tmp
    #TYPE='TCGA-BRCA2'
```


In [None]:
## Remote (Group E): ec2-54-213-224-60.us-west-2.compute.amazonaws.com
# mkdir tmp
# aws s3 cp s3://crownproject/tcga2/scripts/ADcalc_tcga.sh ./

## VIM Edit file to run BRCA analysis w/out shutdown
# screen -L
# bash ADcalc_tcga.sh

In [None]:
## Remove (Group C2):

## Cannot connect to Group C node, will re-try LUAD analysis to
## see if this is a reproducible error or a one-off error.

## Finished successfully.

#### Quality Control 1 -- Line Counts

In [3]:
## Quality Control 1

## REMOTE:
# mkdir ~/tmp; cd tmp;
# aws s3 cp --recursive s3://crownproject/tcga2/gvcf_190520/ ./

## Check line count for each file (many are incomplete)
# wc -l $(ls) > gvcf.linecounts_1
# aws s3 cp gvcf.linecounts_1 s3://crownproject/tcga2/gvcf_190520/

mkdir -p ~/Crown/data2/tcga2_gvcf/
cd  ~/Crown/data2/tcga2_gvcf/
aws s3 cp s3://crownproject/tcga2/gvcf_190520/gvcf.linecounts_1 ./

cat gvcf.linecounts_1

Completed 6.0 KiB/6.0 KiB with 1 file(s) remainingdownload: s3://crownproject/tcga2/gvcf_190520/gvcf.linecounts_1 to ./gvcf.linecounts_1
       118 TCGA-ACC.18SE.gvcf
      1884 TCGA-ACC.18S.gvcf
      5196 TCGA-ACC.28S.gvcf
       172 TCGA-ACC.5.8S.gvcf
       140 TCGA-ACC.5S.gvcf
        79 TCGA-ACC.bamlist
       123 TCGA-BLCA.18SE.gvcf
      1884 TCGA-BLCA.18S.gvcf
      5198 TCGA-BLCA.28S.gvcf
       172 TCGA-BLCA.5.8S.gvcf
       137 TCGA-BLCA.5S.gvcf
       862 TCGA-BLCA.bamlist
       123 TCGA-BRCA1.18SE.gvcf
      1884 TCGA-BRCA1.18S.gvcf
      5212 TCGA-BRCA1.28S.gvcf
       172 TCGA-BRCA1.5.8S.gvcf
       136 TCGA-BRCA1.5S.gvcf
       124 TCGA-BRCA2.18SE.gvcf
      1885 TCGA-BRCA2.18S.gvcf
      5247 TCGA-BRCA2.28S.gvcf
       172 TCGA-BRCA2.5.8S.gvcf
       138 TCGA-BRCA2.5S.gvcf
      2408 TCGA-BRCA.bamlist
       121 TCGA-CESC.18SE.gvcf
      1884 TCGA-CESC.18S.gvcf
      5198 TCGA-CESC.28S.gvcf
       172 TCGA-CESC.5.8S.gvcf
       137 TCGA-C

The following project-files are incomplete:

```
VCF_LINES	PROJECT	FILE	FAIL
14	TCGA-OV	28S	X
28	TCGA-STAD	18S	X
30	TCGA-OV	18S	X
35	TCGA-STAD	28S	X
1704	TCGA-LUAD	28S	X
1709	TCGA-LUSC	28S	X
```

This is almost certainly a memory issue as these are particularily large files. There are two solutions.

1) Re-run LUAD/LUSC 28S on an instance with more memory (32 Gb)
2) For OV/STAD, re-run on a higher memory instance but also with breaking the lists into to sections as I did with BRCA.



**LUAD/LUSC 28S Script** on m4.4xlarge (64 Gb memory)

```
Analyzing TCGA-LUSC...
Started processing 28S
[W::hts_idx_load2] The index file is older than the data file: TCGA-22-4593-01A.hgr1.bam.bai
[W::hts_idx_load2] The index file is older than the data file: TCGA-22-5478-11A.hgr1.bam.bai
[W::hts_idx_load2] The index file is older than the data file: TCGA-43-6647-01A.hgr1.bam.bai
[W::hts_idx_load2] The index file is older than the data file: TCGA-43-6773-11A.hgr1.bam.bai
[mpileup] 551 samples in 551 input files
Warning: Potential memory hog, up to 55100000M reads in the pileup!
...
```

In [None]:
#!/bin/bash
# ADcalc_tcga.sh
# Allelic Depth Calculator
# for a position
#
# s3://crownproject/tcga2/scripts/ADcalc_tcga.sh

# Controls -----------------
DEPTH='100000' #Max per file DP

# Regions in hgr1.fa reference genome
REGIONS='chr13:1007948-1013018' # 28 S only

# Corresponding region/gene names
GENES='28S'

# Terminate instances upon completion (for debugging)
TERMINATE='TRUE'

# S3 Output directory
S3DIR='s3://crownproject/tcga2/gvcf_190520/'

# Script ------------------
BAMLIST='bam.list.tmp'

cd ~/tcga/
mkdir -p GVCF #Output Folder

# Iterate through LUAD and LUSC
# Group A -----------------
for TYPE in {'TCGA-LUSC','TCGA-LUAD'}
do
    echo Analyzing $TYPE...
    cd $TYPE

    ls *.bam > bam.list.tmp
    ls *.bam >> ../GVCF/$TYPE.bamlist
    #ls *.bam >> ../GVCF/tcga.bamlist
           
    for index in ${!GENES[*]}
    do
      printf "Started processing %s\n" ${GENES[$index]}
      OUTPUT="../GVCF/$TYPE.${GENES[$index]}.gvcf"

      # Iterate through every bam file in directory
      # look-up position and return VCF
      bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
        --max-depth $DEPTH -A --min-BQ 30 \
        -a FORMAT/DP,AD \
        -r ${REGIONS[$index]} \
        --ignore-RG \
        -b $BAMLIST | \
        bcftools annotate -x INFO,FORMAT/PL - | \
        bcftools view -O v - \
        >> $OUTPUT

      RESULTS+=("$OUTPUT")
      printf "Done with %s \n" ${GENES[$index]}
      printf "%s\n" ${REGIONS[$index]}

    done

    rm bam.list.tmp

    cd .. # move to tcga folder to reset
done

# Copy GVCF output to AWS S3
cd GVCF
aws s3 cp --recursive ./ $S3DIR

# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
sleep 20s # to catch errors

if [ "$TERMINATE" = TRUE ]
then
  echo "Run Complete -- Shutting down instance."
  aws ec2 terminate-instances --instance-ids $EC2ID
else
  echo "Run Complete -- Instance is online."
fi


#### OV/STAD Re-run

Use `m4.4xlarge` instance type (64 Gb)

In [None]:
#!/bin/bash
# ADcalc_tcga.sh
# Allelic Depth Calculator
# for a position
#
# s3://crownproject/tcga2/scripts/ADcalc_tcga.sh

# Controls -----------------
DEPTH='100000' #Max per file DP

# Regions in hgr1.fa reference genome
REGIONS=('chr13:1003660-1005529' 'chr13:1005529-1005629' \
        'chr13:10219-10340' 'chr13:1006622-1006779' 'chr13:1007948-1013018')

# Corresponding region/gene names
GENES=('18S' '18SE' '5S' '5.8S' '28S')

# 18S  1870
# 18SE 101
# 5S   122
# 5.8S 158
# 28S  5071

# Terminate instances upon completion (for debugging)
TERMINATE='TRUE'

# S3 Output directory
S3DIR='s3://crownproject/tcga2/gvcf_190520/'

# Script ------------------
BAMLIST='bam.list.tmp'

cd ~/tcga/
mkdir -p GVCF #Output Folder

for TYPE in {'TCGA-OV','TCGA-STAD'}
do
    echo Analyzing $TYPE...
    cd $TYPE

    ls *.bam > bam.list.tmp
    ls *.bam >> ../GVCF/$TYPE.bamlist
    #ls *.bam >> ../GVCF/tcga.bamlist
           
    for index in ${!GENES[*]}
    do
      printf "Started processing %s\n" ${GENES[$index]}
      OUTPUT="../GVCF/$TYPE.${GENES[$index]}.gvcf"

      # Iterate through every bam file in directory
      # look-up position and return VCF
      bcftools mpileup -f ~/resources/hgr1/hgr1.fa \
        --max-depth $DEPTH -A --min-BQ 30 \
        -a FORMAT/DP,AD \
        -r ${REGIONS[$index]} \
        --ignore-RG \
        -b $BAMLIST | \
        bcftools annotate -x INFO,FORMAT/PL - | \
        bcftools view -O v - \
        >> $OUTPUT

      RESULTS+=("$OUTPUT")
      printf "Done with %s \n" ${GENES[$index]}
      printf "%s\n" ${REGIONS[$index]}

    done

    rm bam.list.tmp

    cd .. # move to tcga folder to reset
done

# Copy GVCF output to AWS S3
cd GVCF
aws s3 cp --recursive ./ $S3DIR

# Shutdown and Terminate instance
EC2ID=$(ec2metadata --instance-id)
sleep 20s # to catch errors

if [ "$TERMINATE" = TRUE ]
then
  echo "Run Complete -- Shutting down instance."
  aws ec2 terminate-instances --instance-ids $EC2ID
else
  echo "Run Complete -- Instance is online."
fi


Final two gvcf calculations finished on May 29 2019

#### Download Full GVCF Dataset

In [4]:
## Download Full Dataset
# mkdir ~/tmp; cd ~/tmp;
# aws s3 cp --recursive s3://crownproject/tcga2/gvcf_190520/ ./
# tar -cf gvcf_190520.tar $(ls)
# gzip gvcf_190520.tar
# aws s3 cp gvcf_190520.tar.gz s3://crownproject/tcga2/gvcf_190520/

## Local:
cd  ~/Crown/data2/tcga2_gvcf/

aws s3 ls s3://crownproject/tcga2/gvcf_190520/
aws s3 cp s3://crownproject/tcga2/gvcf_190520/gvcf_190520.tar.gz ./

2019-05-20 12:48:56    2440572 TCGA-ACC.18S.gvcf
2019-05-20 12:48:56      68401 TCGA-ACC.18SE.gvcf
2019-05-20 12:48:56    5969434 TCGA-ACC.28S.gvcf
2019-05-20 12:48:56     174180 TCGA-ACC.5.8S.gvcf
2019-05-20 12:48:56     112370 TCGA-ACC.5S.gvcf
2019-05-20 12:48:56       2054 TCGA-ACC.bamlist
2019-05-22 05:54:10   13257752 TCGA-BLCA.18S.gvcf
2019-05-22 05:54:10     450299 TCGA-BLCA.18SE.gvcf
2019-05-22 05:54:10   32953647 TCGA-BLCA.28S.gvcf
2019-05-22 05:54:10    1025336 TCGA-BLCA.5.8S.gvcf
2019-05-22 05:54:11     673323 TCGA-BLCA.5S.gvcf
2019-05-22 05:54:11      22424 TCGA-BLCA.bamlist
2019-05-22 06:08:17      62620 TCGA-BRCA.bamlist
2019-05-22 06:08:17   18830965 TCGA-BRCA1.18S.gvcf
2019-05-22 06:08:17     595068 TCGA-BRCA1.18SE.gvcf
2019-05-22 06:08:17   45076513 TCGA-BRCA1.28S.gvcf
2019-05-22 06:08:17    1407393 TCGA-BRCA1.5.8S.gvcf
2019-05-22 06:08:17     970610 TCGA-BRCA1.5S.gvcf
2019-05-22 06:08:17   19030576 TCGA-BRCA2.18S.gvcf
2019-05-22 06:08:17     570445 

## Discussion

Done QED