# Colorectal Carcinoma Cell Lines
```
pi:ababaian
files: ~/Crown/data/crc_lines
start: 2017 05 12
complete : 2017 05 19
```
## Introduction

In the lab we have CRC cell lines. I have the notion that 1248U may be differentially modified in CRC, so to screen some publically available RNA-seq for variants in rRNA which can be rapidly confirmed with cell lines in the freezer.

[Fasterius](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0171435) published HCT116, RKO and HKE3 RNA-seq [+GEO](https://www.ncbi.nlm.nih.gov/sra/SRX1745244[accn]) and has references for a few more libraries.


There is an even more comprehensive Cell Line sequencing project [here](www.nature.com/nbt/journal/v33/n3/full/nbt.3080.html) but the data is locked up at EBI.
### Hypothesis / Objective
- Screen for rare expressed variants / hypo-mods in rRNA of CRC cell lines

Break up analysis into two arms; HCT116, RKO and HKE3 and the remaining lines.


## Material and Methods

Initialize Workspace

In [1]:
CROWN='/home/artem/Crown'
DATA_HOME='/home/artem/Crown/data/'



In [2]:
cd $DATA_HOME
mkdir -p crc_lines
cd crc_lines



In [3]:
# Acquire Data from SRA
# HCT116
prefetch SRR3479755
prefetch SRR3479757

# RKO
prefetch SRR3479758
prefetch SRR3479760

# HKE3
prefetch SRR3479763
# prefetch SRR3479764


2017-05-13T16:58:07 prefetch.2.8.2: 1) Downloading 'SRR3479764'...
2017-05-13T16:58:07 prefetch.2.8.2:  Downloading via http...
2017-05-13T17:22:18 prefetch.2.8.2 sys: transfer interrupted while reading file within network system module - mbedtls_ssl_read returned -26880 ( SSL - Connection requires a read call )
2017-05-13T17:22:18 prefetch.2.8.2 int: transfer interrupted while reading file within network system module - x
2017-05-13T17:22:18 prefetch.2.8.2: 1) failed to download SRR3479764


In [None]:
prefetch SRR3479764 
# ran in a seperate script window
# failed 4x, continue without for now
# download another replicate
# also tried prefetch SRR3479762

In [4]:
# Dump to FASTQ format in crc_lines folder

# HCT116
fastq-dump --gzip --split-files SRR3479755


2017-05-13T19:18:24 fastq-dump.2.8.2 err: process canceled while executing process - failed SRR3479757


In [1]:
fastq-dump --gzip --split-files SRR3479757

#RKO
fastq-dump --gzip --split-files SRR3479758
fastq-dump --gzip --split-files SRR3479760

Read 37049468 spots for SRR3479757
Written 37049468 spots for SRR3479757
Read 35763277 spots for SRR3479758
Written 35763277 spots for SRR3479758
Read 33883402 spots for SRR3479760
Written 33883402 spots for SRR3479760


In [3]:
#HKE3
fastq-dump --gzip --split-files SRR3479763
#fastq-dump --gzip --split-files SRR3479764

Read 67249613 spots for SRR3479763
Written 67249613 spots for SRR3479763


In [5]:
cat crc_lines_data.txt
# Note: SRR3479764 was removed
# from initial analysis. Repeated download
# errors.

hct116_1	HCT116	SRX1745244	inVitro	SRR3479755_1.fastq.gz	SRR3479755_2.fastq.gz
hct116_2	HCT116	SRX1745244	inVitro	SRR3479757_1.fastq.gz	SRR3479757_2.fastq.gz
rko_1	RKO	SRX1745244	inVitro	SRR3479758_1.fastq.gz	SRR3479758_2.fastq.gz
rko_2	RKO	SRX1745244	inVitro	SRR3479760_1.fastq.gz	SRR3479760_2.fastq.gz
hke3_1	HKE3	SRX1745244	inVitro	SRR3479763_1.fastq.gz	SRR3479763_2.fastq.gz


In [6]:
#!/bin/bash
# crc_lines_align_hgr1.fa
# rDNA alignment pipeline
# for crc cell line data on local machine

# Control Panel -------------------------------

# Project Dir
  BASE='/home/artem/Crown/data/crc_lines/'
  cd $BASE

# Sequencing Data
  CRC_DIR='/home/artem/Crown/data/crc_lines/'
  LIB_LIST='crc_lines_data.txt' # list of fastq files in hgr1 format
  
# CPU
  THREADS='1'
  
# Initialize start-up sequence ----------------
# Make working directory
  mkdir -p align
  mkdir -p flagstat

#Resources
  cp $CROWN/resources/hgr1/hgr1.fa ./
  samtools faidx hgr1.fa
  bowtie2-build hgr1.fa hgr1


Settings:
  Output files: "hgr1.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  hgr1.fa
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 4072
Using parameters --bmax 3054 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 3054 --dcv 1024
Constructing suff

In [3]:
# script below originally interrupted
# completed 3/5 alignments. restarting at 4/5




In [4]:
# ---------------------------------------------
# SCRIPT LOOP ---------------------------------
# ---------------------------------------------
# For each line in input LIB_LIST; run the pipeline


cat $LIB_LIST | while read LINE
do
    #Initialize Run
    echo "Start Iteration:"
    echo "  $LINE"
    echo ''
    
    LIBRARY=$(echo $LINE | cut -f1 -d' ' -) # Library Name
    RGSM=$(echo $LINE | cut -f2 -d' ' -)    # Sample / Patient Identifer
    RGID=$(echo $LINE | cut -f3 -d' ' -)    # Read Group ID
    RGLB=$(echo $LINE | cut -f3 -d' ' -)    # Library Name. Accession Number
    RGPL='ILLUMINA'                   # Sequencing Platform.
    RGPO=$(echo $LINE | cut -f4 -d' ' -)    # Patient Population

    FASTQ1=$(echo $LINE | cut -f5 -d' ' -)  # Filename Read 1
    FASTQ2=$(echo $LINE | cut -f6 -d' ' -)  # Filename Read 2  

    FQ1="$CRC_DIR/$FASTQ1"            # Fastq1 Filepath
    FQ2="$CRC_DIR/$FASTQ2"            # Fastq1 Filepath

    echo "Lib: $LIBRARY"
    echo "RGSM: $RGSM"
    echo "RGID: $RGID"
    echo "RGLB: $RGLB"
    echo "RGPL: $RGPL"
    echo "RGPO: $RGPO"
    echo "FQ: $FQ1 $FQ2"
    echo ''
    echo ''  

    # Extract Sequencing Run Info
    RGPU=$(gzip -dc $FQ1 | head -n1 - | cut -f1 -d'.' | cut -f2 -d'@')

    # Bowtie2: align to genome
    bowtie2 --very-sensitive-local -p $THREADS --rg-id $RGID \
      --rg LB:$RGLB --rg SM:$RGSM \
      --rg PL:$RGPL --rg PU:$RGPU \
      -x hgr1 -1 $FQ1 -2 $FQ2 |\
      samtools view -bS - > aligned_unsorted.bam

    # Calcualte library flagstats
    samtools flagstat aligned_unsorted.bam > aligned_unsorted.flagstat



    # Read Subset ------------------------------
    # Extract mapped reads, and their unmapped pairs

      # Extract Header
      samtools view -H aligned_unsorted.bam > align.header.tmp

      # Unmapped reads with mapped pairs
      # Extract Mapped Reads
      # and their unmapped pairs
      samtools view -b -F 4 aligned_unsorted.bam > align.F4.bam #mapped
      samtools view -b -f 4 -F 8 aligned_unsorted.bam > align.f4F8.bam #unmapped pairs

      # What are the mapped readnames
      samtools view align.F4.bam | cut -f1 - > read.names.tmp

      # Extract mapped reads
      samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam


      # Extract cases of read pairs mapped on edge of region of interest
      # -------|======= R O I ======| ----------
      # read:                  ====---====
      samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam

      # Complete mapped reads list
      #cut -f1 align.F4.tmp.sam > read.names.45s.long.tmp

      # Extract unmapped reads with a mapped pair
      samtools view align.f4F8.bam | grep -Ff read.names.tmp - > align.f4F8.tmp.sam

      # Re-compile bam file
      cat align.header.tmp align.F4.tmp.sam align.f4F8.tmp.sam | samtools view -bS - > align.hgr1.tmp.bam
        samtools sort align.hgr1.tmp.bam -o align.hgr1.bam


    # mv align.hgr1 align.hgr1.bam
        samtools index align.hgr1.bam
        samtools flagstat align.hgr1.bam > align.hgr1.flagstat

      # Clean up 
      rm *tmp* align.F4.bam align.f4F8.bam

    # Rename the hgr Bam files
      mv align.hgr1.bam align/$LIBRARY.hgr1.bam
      mv align.hgr1.bam.bai align/$LIBRARY.hgr1.bam.bai
      mv align.hgr1.flagstat flagstat/$LIBRARY.hgr1.flagstat

    # Rename the total Bam Files
      rm aligned_unsorted.bam
      mv aligned_unsorted.flagstat flagstat/$LIBRARY.flagstat

done

# Script complete

Start Iteration:
  rko_2	RKO	SRX1745244	inVitro	SRR3479760_1.fastq.gz	SRR3479760_2.fastq.gz

Lib: rko_2
RGSM: RKO
RGID: SRX1745244
RGLB: SRX1745244
RGPL: ILLUMINA
RGPO: inVitro
FQ: /home/artem/Crown/data/crc_lines//SRR3479760_1.fastq.gz /home/artem/Crown/data/crc_lines//SRR3479760_2.fastq.gz



gzip: stdout: Broken pipe
33883402 reads; of these:
  33883402 (100.00%) were paired; of these:
    33610018 (99.19%) aligned concordantly 0 times
    273310 (0.81%) aligned concordantly exactly 1 time
    74 (0.00%) aligned concordantly >1 times
    ----
    33610018 pairs aligned concordantly 0 times; of these:
      13862 (0.04%) aligned discordantly 1 time
    ----
    33596156 pairs aligned 0 times concordantly or discordantly; of these:
      67192312 mates make up the pairs; of these:
        67180873 (99.98%) aligned 0 times
        11408 (0.02%) aligned exactly 1 time
        31 (0.00%) aligned >1 times
0.86% overall alignment rate
Start Iteration:
  hke3_1

### Inspection of reads in IGV for macp

#### HCT116

```
HCT116 - 1
chr13:1,004,908
<hr>Total count: 2596
A      : 100  (4%,     53+,   47- )
C      : 1185  (46%,     572+,   613- )
G      : 110  (4%,     50+,   60- )
T      : 1201  (46%,     516+,   685- )
N      : 0
---------------
DEL: 101
INS: 0

HCT116 - 2
chr13:1,004,908
<hr>Total count: 4716
A      : 145  (3%,     69+,   76- )
C      : 1885  (40%,     862+,   1023- )
G      : 191  (4%,     112+,   79- )
T      : 2495  (53%,     1164+,   1331- )
N      : 0
---------------
DEL: 214
INS: 0
```

#### HKE

```
HKE - 1
chr13:1,004,908
<hr>Total count: 4312
A      : 163  (4%,     99+,   64- )
C      : 1696  (39%,     1038+,   658- )
G      : 132  (3%,     86+,   46- )
T      : 2321  (54%,     1467+,   854- )
N      : 0
---------------
DEL: 222
INS: 0
```

#### RKO

```
RKO - 1
chr13:1,004,908
<hr>Total count: 675
A      : 21  (3%,     11+,   10- )
C      : 206  (31%,     117+,   89- )
G      : 18  (3%,     14+,   4- )
T      : 430  (64%,     229+,   201- )
N      : 0
---------------
DEL: 25
INS: 0

RKO - 2
chr13:1,004,908
<hr>Total count: 1537
A      : 29  (2%,     12+,   17- )
C      : 501  (33%,     250+,   251- )
G      : 33  (2%,     15+,   18- )
T      : 974  (63%,     435+,   539- )
N      : 0
---------------
DEL: 63
INS: 0
```



In [5]:
# Acquire Data set II
# HCT116
prefetch SRR3479756

# RKO
prefetch SRR3479759

fastq-dump --gzip --split-files SRR3479756
fastq-dump --gzip --split-files SRR3479759


2017-05-17T02:33:58 prefetch.2.8.2: 1) Downloading 'SRR3479756'...
2017-05-17T02:33:58 prefetch.2.8.2:  Downloading via http...
2017-05-17T04:26:09 prefetch.2.8.2: 1) 'SRR3479756' was downloaded successfully

2017-05-17T04:26:26 prefetch.2.8.2: 1) Downloading 'SRR3479759'...
2017-05-17T04:26:26 prefetch.2.8.2:  Downloading via http...
2017-05-17T06:45:23 prefetch.2.8.2: 1) 'SRR3479759' was downloaded successfully
Read 29388187 spots for SRR3479756
Written 29388187 spots for SRR3479756
Read 33832789 spots for SRR3479759
Written 33832789 spots for SRR3479759


In [5]:
# Project Dir
  BASE='/home/artem/Crown/data/crc_lines/'
  cd $BASE

# Sequencing Data
  CRC_DIR='/home/artem/Crown/data/crc_lines/'
  LIB_LIST='crc_lines_data_2.txt' # list of fastq files in hgr1 format
  cat $LIB_LIST
  
  # HCT116 - 3 ran and made, jupyter crashed afterwards
  # use a bash termianl in teh future, jupyter is not as stable

  LIB_LIST='crc_lines_data_3.txt' # list of fastq files in hgr1 format

# CPU
  THREADS='1'

cat $LIB_LIST

# ---------------------------------------------
# SCRIPT LOOP ---------------------------------
# ---------------------------------------------
# For each line in input LIB_LIST; run the pipeline


cat $LIB_LIST | while read LINE
do
    #Initialize Run
    echo "Start Iteration:"
    echo "  $LINE"
    echo ''
    
    LIBRARY=$(echo $LINE | cut -f1 -d' ' -) # Library Name
    RGSM=$(echo $LINE | cut -f2 -d' ' -)    # Sample / Patient Identifer
    RGID=$(echo $LINE | cut -f3 -d' ' -)    # Read Group ID
    RGLB=$(echo $LINE | cut -f3 -d' ' -)    # Library Name. Accession Number
    RGPL='ILLUMINA'                   # Sequencing Platform.
    RGPO=$(echo $LINE | cut -f4 -d' ' -)    # Patient Population

    FASTQ1=$(echo $LINE | cut -f5 -d' ' -)  # Filename Read 1
    FASTQ2=$(echo $LINE | cut -f6 -d' ' -)  # Filename Read 2  

    FQ1="$CRC_DIR/$FASTQ1"            # Fastq1 Filepath
    FQ2="$CRC_DIR/$FASTQ2"            # Fastq1 Filepath

    echo "Lib: $LIBRARY"
    echo "RGSM: $RGSM"
    echo "RGID: $RGID"
    echo "RGLB: $RGLB"
    echo "RGPL: $RGPL"
    echo "RGPO: $RGPO"
    echo "FQ: $FQ1 $FQ2"
    echo ''
    echo ''  

    # Extract Sequencing Run Info
    RGPU=$(gzip -dc $FQ1 | head -n1 - | cut -f1 -d'.' | cut -f2 -d'@')

    # Bowtie2: align to genome
    bowtie2 --very-sensitive-local -p $THREADS --rg-id $RGID \
      --rg LB:$RGLB --rg SM:$RGSM \
      --rg PL:$RGPL --rg PU:$RGPU \
      -x hgr1 -1 $FQ1 -2 $FQ2 |\
      samtools view -bS - > aligned_unsorted.bam

    # Calcualte library flagstats
    samtools flagstat aligned_unsorted.bam > aligned_unsorted.flagstat



    # Read Subset ------------------------------
    # Extract mapped reads, and their unmapped pairs

      # Extract Header
      samtools view -H aligned_unsorted.bam > align.header.tmp

      # Unmapped reads with mapped pairs
      # Extract Mapped Reads
      # and their unmapped pairs
      samtools view -b -F 4 aligned_unsorted.bam > align.F4.bam #mapped
      samtools view -b -f 4 -F 8 aligned_unsorted.bam > align.f4F8.bam #unmapped pairs

      # What are the mapped readnames
      samtools view align.F4.bam | cut -f1 - > read.names.tmp

      # Extract mapped reads
      samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam


      # Extract cases of read pairs mapped on edge of region of interest
      # -------|======= R O I ======| ----------
      # read:                  ====---====
      samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam

      # Complete mapped reads list
      #cut -f1 align.F4.tmp.sam > read.names.45s.long.tmp

      # Extract unmapped reads with a mapped pair
      samtools view align.f4F8.bam | grep -Ff read.names.tmp - > align.f4F8.tmp.sam

      # Re-compile bam file
      cat align.header.tmp align.F4.tmp.sam align.f4F8.tmp.sam | samtools view -bS - > align.hgr1.tmp.bam
        samtools sort align.hgr1.tmp.bam -o align.hgr1.bam


    # mv align.hgr1 align.hgr1.bam
        samtools index align.hgr1.bam
        samtools flagstat align.hgr1.bam > align.hgr1.flagstat

      # Clean up 
      rm *tmp* align.F4.bam align.f4F8.bam

    # Rename the hgr Bam files
      mv align.hgr1.bam align/$LIBRARY.hgr1.bam
      mv align.hgr1.bam.bai align/$LIBRARY.hgr1.bam.bai
      mv align.hgr1.flagstat flagstat/$LIBRARY.hgr1.flagstat

    # Rename the total Bam Files
      rm aligned_unsorted.bam
      mv aligned_unsorted.flagstat flagstat/$LIBRARY.flagstat

done

# Script complete

hct116_3	HCT116	SRX1745244	inVitro	SRR3479756_1.fastq.gz	SRR3479756_2.fastq.gz
rko_3	RKO	SRX1745244	inVitro	SRR3479759_1.fastq.gz	SRR3479759_2.fastq.gz
Start Iteration:
  rko_3	RKO	SRX1745244	inVitro	SRR3479759_1.fastq.gz	SRR3479759_2.fastq.gz

Lib: rko_3
RGSM: RKO
RGID: SRX1745244
RGLB: SRX1745244
RGPL: ILLUMINA
RGPO: inVitro
FQ: /home/artem/Crown/data/crc_lines//SRR3479759_1.fastq.gz /home/artem/Crown/data/crc_lines//SRR3479759_2.fastq.gz



gzip: stdout: Broken pipe
33832789 reads; of these:
  33832789 (100.00%) were paired; of these:
    33557460 (99.19%) aligned concordantly 0 times
    275273 (0.81%) aligned concordantly exactly 1 time
    56 (0.00%) aligned concordantly >1 times
    ----
    33557460 pairs aligned concordantly 0 times; of these:
      15811 (0.05%) aligned discordantly 1 time
    ----
    33541649 pairs aligned 0 times concordantly or discordantly; of these:
      67083298 mates make up the pairs; of these:
        67072724 (99.98%) al

In [3]:
# Clean-up directory

mkdir fq

mv *.fastq.gz fq/
mv *.txt fq/

rm hgr1*



## Results / Discussion

### RAF at 1248U


| Cell Line | rep 1 | rep 2 | rep 3 |
|-----------|-------|-------|-------|
| HCT116    |   46  |  53   |   51  |
| RKO       |   64  |  63   |   56  |
| HKE3      |   54  |       |       |


RKO is a good candidate for having a hypo-modification phenotype (Student's T-test between HCT116 and RKO, P = 0.028. All samples are essentially negative for 18S-E precursor so it's not an apparent biogenesis phenotype. Confirm in the lab is the next move


### 18S.990A > G variant

While inspecting the RNA-seq libraries in IGV there was a notable difference between HCT116 and RKO libraries at `chr13:1004650`. A ~10% expressed variant at a conserved position (and protein contact) 18S.990A>G in HCT116.

![IGV view of 990A](../../data/crc_lines/plot/170519_18S.A990G_hct116_variant.png)

This position is highly conserved (although not perfectly) in eukaryotes but not prokaryotes/archae. Sequence context is:

`gacggacc A gagcgaaag`

![Conservation](../../data/crc_lines/plot/170519_18S.A990G.eukAlgn.png)

It's involved in secondary structure (helix 21a) with base-pairing to A1001 within 18S.

![18S helix 21a](../../data/crc_lines/plot/170519_18S.A990_2ndStruc.png)

From the 4UG0 structure, it looks like this a change to G at this position would disrupt the base-pairing interaction although this would require simulation/structural analysis to see if there's another local minima by shifting things around.

![helix 21 structure](../../data/crc_lines/plot/170519_990A.hct116.png)

---

This variant is rare in the DNA-seq dataset (107 genomes) being called in only 2/107 genomes. So this is clearly a variant haplotype and likely has biochemical consequences. The protein chain in the above image is S3a, which wouldn't really be disrupted in the interaction but I'm thinking that entire helix 21 may fold differently.


In [4]:
# For future reference the homologous positions of A990
cat plot/170519_18S.990A.txt

>Homo_sapiens_18S.990A
>Mus_Musculus_18S.991A
>C_elegans_18S.907A
>D_melanogaster_18S.U1019
>A_thaliana_18S.937A
>S_cerevisae_18S.933A
>E_coli_18S.722G


QED