# MCF7 initial analysis
```
pi:ababaian
files: ~/Crown/data/mcf7
start: 2017 05 11
complete : 2017 05 12
```
## Introduction

On the protein atlas website entry for [TSR3](http://www.proteinatlas.org/ENSG00000007520-TSR3/cell#human), MCF7 breast cancer cell line is negative for TSR3. This could be a staining issue but if the enzyme is depleted here then I would expect a hypo-modification phenotype similar to CRC.

![MCF7 Protien](../../data/mcf7/plot/20170511_protein_atlas.png)

### Hypothesis

- The TSR3-depleted (assumed) breast cancer cell line MCF7 will be hypo-modified at 18S 1248.U > macP


## Materials and Methods

- SRR2532362: [MCF7 RNAseq](https://www.ncbi.nlm.nih.gov/pubmed/26771497). Paired-end 50 bp reads

In [6]:
#Initialize project
CROWN='/home/artem/Crown'
cd $CROWN



In [7]:
mkdir -p $CROWN/data/mcf7
mkdir -p $CROWN/data/mcf7/plot



In [8]:
cd $CROWN/data/mcf7



In [11]:
prefetch SRR2532362


2017-05-12T01:44:56 prefetch.2.8.2: 1) Downloading 'SRR2532362'...
2017-05-12T01:44:56 prefetch.2.8.2:  Downloading via http...
2017-05-12T03:01:04 prefetch.2.8.2: 1) 'SRR2532362' was downloaded successfully


In [14]:
fastq-dump --gzip --split-files SRR2532362

Read 47425419 spots for SRR2532362
Written 47425419 spots for SRR2532362


In [23]:
#!/bin/bash
# mcf7_align_hgr1.fa
# rDNA alignment pipeline
# for Myc data on local machine

# Control Panel -------------------------------

# Project Dir
  BASE='/home/artem/Crown/data/mcf7/'
  cd $BASE

# Sequencing Data
  CRC_DIR='/home/artem/Crown/data/mcf7/'
  LIB_LIST='mcf7_data.txt' # list of fastq files in hgr1 format
  
# CPU
  THREADS='1'
  
# Initialize start-up sequence ----------------
# Make working directory
  mkdir -p align

#Resources
  cp $CROWN/resources/hgr1/hgr1.fa ./
  samtools faidx hgr1.fa
  bowtie2-build hgr1.fa hgr1



Settings:
  Output files: "hgr1.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  hgr1.fa
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 4072
Using parameters --bmax 3054 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 3054 --dcv 1024
Constructing suff

In [25]:
# ---------------------------------------------
# SCRIPT LOOP ---------------------------------
# ---------------------------------------------
# For each line in input LIB_LIST; run the pipeline

LINE=$(cat $LIB_LIST)
echo $LINE

    #Initialize Run
    echo "Start Iteration:"
    echo "  $LINE"
    echo ''
    
    LIBRARY=$(echo $LINE | cut -f1 -d' ' -) # Library Name
    RGSM=$(echo $LINE | cut -f2 -d' ' -)    # Sample / Patient Identifer
    RGID=$(echo $LINE | cut -f3 -d' ' -)    # Read Group ID
    RGLB=$(echo $LINE | cut -f3 -d' ' -)    # Library Name. Accession Number
    RGPL='ILLUMINA'                   # Sequencing Platform.
    RGPO=$(echo $LINE | cut -f4 -d' ' -)    # Patient Population

    FASTQ1=$(echo $LINE | cut -f5 -d' ' -)  # Filename Read 1
    FASTQ2=$(echo $LINE | cut -f6 -d' ' -)  # Filename Read 2  

    FQ1="$CRC_DIR/$FASTQ1"            # Fastq1 Filepath
    FQ2="$CRC_DIR/$FASTQ2"            # Fastq1 Filepath

    echo "Lib: $LIBRARY"
    echo "RGSM: $RGSM"
    echo "RGID: $RGID"
    echo "RGLB: $RGLB"
    echo "RGPL: $RGPL"
    echo "RGPO: $RGPO"
    echo "FQ: $FQ1 $FQ2"
    echo ''
    echo ''  

mcf7 MCF7 SRX1293335 inVitro SRR2532362_1.fastq.gz SRR2532362_2.fastq.gz
Start Iteration:
  mcf7	MCF7	SRX1293335	inVitro	SRR2532362_1.fastq.gz	SRR2532362_2.fastq.gz

Lib: mcf7
RGSM: MCF7
RGID: SRX1293335
RGLB: SRX1293335
RGPL: ILLUMINA
RGPO: inVitro
FQ: /home/artem/Crown/data/mcf7//SRR2532362_1.fastq.gz /home/artem/Crown/data/mcf7//SRR2532362_2.fastq.gz




In [26]:
# Extract Sequencing Run Info
RGPU=$(gzip -dc $FQ1 | head -n1 - | cut -f1 -d'.' | cut -f2 -d'@')

# Bowtie2: align to genome
bowtie2 --very-sensitive-local -p $THREADS --rg-id $RGID \
  --rg LB:$RGLB --rg SM:$RGSM \
  --rg PL:$RGPL --rg PU:$RGPU \
  -x hgr1 -1 $FQ1 -2 $FQ2 |\
  samtools view -bS - > aligned_unsorted.bam

# Calcualte library flagstats
samtools flagstat aligned_unsorted.bam > aligned_unsorted.flagstat


# Rename the total Bam Files
  mv aligned_unsorted.bam $LIBRARY.bam
  mv aligned_unsorted.bam.bai $LIBRARY.bam.bai
  mv aligned_unsorted.flagstat $LIBRARY.flagstat

# Script complete


gzip: stdout: Broken pipe
47425419 reads; of these:
  47425419 (100.00%) were paired; of these:
    46379254 (97.79%) aligned concordantly 0 times
    1046090 (2.21%) aligned concordantly exactly 1 time
    75 (0.00%) aligned concordantly >1 times
    ----
    46379254 pairs aligned concordantly 0 times; of these:
      719 (0.00%) aligned discordantly 1 time
    ----
    46378535 pairs aligned 0 times concordantly or discordantly; of these:
      92757070 mates make up the pairs; of these:
        92747510 (99.99%) aligned 0 times
        9359 (0.01%) aligned exactly 1 time
        201 (0.00%) aligned >1 times
2.22% overall alignment rate
mv: cannot stat 'aligned_unsorted.bam.bai': No such file or directory


In [29]:
# Read Subset ------------------------------
# Extract mapped reads, and their unmapped pairs

  # Extract Header
  samtools view -H mcf7.bam > align.header.tmp

  # Unmapped reads with mapped pairs
  # Extract Mapped Reads
  # and their unmapped pairs
  samtools view -b -F 4 mcf7.bam > align.F4.bam #mapped
  samtools view -b -f 4 -F 8 mcf7.bam > align.f4F8.bam #unmapped pairs

  # What are the mapped readnames
  samtools view align.F4.bam | cut -f1 - > read.names.tmp

  # Extract mapped reads
  samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam


  # Extract cases of read pairs mapped on edge of region of interest
  # -------|======= R O I ======| ----------
  # read:                  ====---====
  samtools view align.F4.bam | grep -Ff read.names.tmp - > align.F4.tmp.sam

  # Complete mapped reads list
  #cut -f1 align.F4.tmp.sam > read.names.45s.long.tmp

  # Extract unmapped reads with a mapped pair
  samtools view align.f4F8.bam | grep -Ff read.names.tmp - > align.f4F8.tmp.sam

  # Re-compile bam file
  cat align.header.tmp align.F4.tmp.sam align.f4F8.tmp.sam | samtools view -bS - > align.hgr1.tmp.bam
    samtools sort align.hgr1.tmp.bam -o align.hgr1.bam

[E::hts_open_format] fail to open file 'align.hgr1.bam'
samtools index: failed to open "align.hgr1.bam": No such file or directory
[E::hts_open_format] fail to open file 'align.hgr1.bam'
samtools flagstat: Cannot open input file "align.hgr1.bam": No such file or directory
mv: cannot stat 'align.hgr1.bam': No such file or directory
mv: cannot stat 'align.hgr1.bam.bai': No such file or directory


In [31]:
# mv align.hgr1 align.hgr1.bam
    samtools index align.hgr1.bam
    samtools flagstat align.hgr1.bam > align.hgr1.flagstat

  # Clean up 
  rm *tmp* align.F4.bam align.f4F8.bam

# Rename the hgr Bam files
  mv align.hgr1.bam align/$LIBRARY.hgr1.bam
  mv align.hgr1.bam.bai align/$LIBRARY.hgr1.bam.bai
  mv align.hgr1.flagstat flagstat/$LIBRARY.hgr1.flagstat

rm: cannot remove '*tmp*': No such file or directory
rm: cannot remove 'align.F4.bam': No such file or directory
rm: cannot remove 'align.f4F8.bam': No such file or directory


## Results

Modification appears normal at 1248U. Most likley the protein atlas thing is simply an antibody that doesn't work in MCF7 or sub-detection level.

```
chr13:1,004,908
<hr>Total count: 5277
A      : 211  (4%,     94+,   117- )
C      : 2210  (42%,     964+,   1246- )
G      : 285  (5%,     144+,   141- )
T      : 2571  (49%,     1193+,   1378- )
N      : 0
---------------
DEL: 231
INS: 0
```

QED