# MCB112 pset02: the adventure of the ten Arcs
* Eric Yang
* 09/21/2020

In [168]:
import numpy as np

## kallisto

In [169]:
! kallisto

kallisto 0.46.2

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index 
    quant         Runs the quantification algorithm 
    bus           Generate BUS files for single-cell data 
    pseudo        Runs the pseudoalignment step 
    merge         Merges several batch runs 
    h5dump        Converts HDF5-formatted results to plaintext
    inspect       Inspects and gives information about an index
    version       Prints version information
    cite          Prints citation information

Running kallisto <CMD> without arguments prints usage information for <CMD>



## reproduce Moriarty's result

In [170]:
# Explore fasta file
! gunzip -c arc.fasta.gz | head -10

>Arc1
TAGCCTTCATCCTGTGTGGGTGTGGGCTCCCACTCGGTTCTAGGTCAGTACGAGCCTGCA
CCTTCCTGTGGAGCAAGTCCGTCTCCTTCCTGCGCTCATACCTAATGAGTGAGGCGCTAA
CTGCCCCTATGGGCGGATGGACCCAACTAGCCCATGAGTCGACCACCAGAGAACCTTGAT
CCGTCCTTGCCAGCATTAATGAGCATTCTCTTAGTTTTGACAGCGGGGCGATTCATGAGA
AACATATGCTTCCCCTTGTTCGAGCCGGATCACTTGAGTCGATACGTCTCCGGGGGTCTC
CGGGGAAGCCTCAGGGACCTAGTCCGATAACAGACACCTATATGCTAGTTGCTGGTGGAT
TGTGTTTCAATCTTCTTCCAAGAAGTGCACGTAAACATGGGGGTGTCGGTTATGGAAAGG
ATACCTATCTCCAGAATCAGTAACAAGTCAATGTAACGGGACGCACGGGACTCACCATCT
CTAGTATGCACTCTGCCGATGGGAACTTCGAATGCGCGATGCCTCTATTTCCAGTTGTAG


In [185]:
# Explore fastq file
! gunzip -c arc.fastq.gz | head -10

@read0
GTATCCGTGAATAACCCACCTAATGCATGGGCGTTCAAATGGTGGTTATGCTAAAAAAGACGTGGGAATTTTGCA
+
???????????????????????????????????????????????????????????????????????????
@read1
AAAGATACCTACGAGCTCGAACTAGCACTATGACAAACATGCTGCGCGTCCACTTCCCACCGTAACGCCGAAGTG
+
???????????????????????????????????????????????????????????????????????????
@read2
GACCCCCGGAGACGTATCGACTCAAGTGATCCGGCTCGAACAAGGGGAAGCATATGTTTCTCATGAATCGCCCCG
gunzip: error writing to output: Broken pipe
gunzip: arc.fastq.gz: uncompress failed


In [172]:
# Build kallisto index of transcriptome
! kallisto index -i transcripts.idx arc.fasta.gz


[build] loading fasta file arc.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 19 contigs and contains 10000 k-mers 



In [173]:
# Map reads against transcriptome
! kallisto quant -i transcripts.idx -o output --single -l 150 -s 20 arc.fastq.gz  


[quant] fragment length distribution is truncated gaussian with mean = 150, sd = 20
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 26
[quant] running in single-end mode
[quant] will process file 1: arc.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 99,981 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds



In [174]:
# Look at abundance.tsv output, check match with moriarty data
with open("output/abundance.tsv",'r') as infile:   
    for line in infile:
        print(line)

target_id	length	eff_length	est_counts	tpm

Arc1	4000	3851	3348.79	24514.5

Arc2	2000	1851	3574.56	54440.7

Arc3	3000	2851	28065.2	277510

Arc4	4000	3851	10597.7	77579.1

Arc5	4000	3851	12526.7	91700.1

Arc6	3000	2851	1953.39	19315.2

Arc7	2000	1851	5579.78	84980.4

Arc8	2000	1851	5700.81	86823.7

Arc9	3000	2851	3052.55	30183.8

Arc10	3000	2851	25581.6	252953



These tpms are in the same order of magnitude as Moriarty's results, essentially matching.

## simulate an Arc transciptome and RNA-seq reads

In [175]:
# Set up the Arc locus 
S         = 10           # Number of segments in the Arc locus (A..J)
T         = S            # Number of different transcripts (the same, one starting on each segment, 1..10)
N         = 100000       # total number of observed reads we generate
len_S     = 1000         # length of each segment (nucleotides)
len_Arc   = len_S * S    # total length of the Arc locus (nucleotides)
len_R     = 75           # read length

In [176]:
# Generate 10kb Arc locus DNA sequence
np.random.seed(5)
arc_seq = ''.join(np.random.choice(list('ACGT'), len_Arc))

In [177]:
# Arc locus length and abundance
L = [4 * len_S, 2 * len_S, 3 * len_S, 4 * len_S, 4 * len_S,
     3 * len_S, 2 * len_S, 2 * len_S, 3 * len_S, 3 * len_S,]
V = [0.0081, 0.0391, 0.2911, 0.1121, 0.1271, 0.0081, 0.0591, 0.0601, 0.0221, 0.2731] 
#added 0.0001 to each vi to make sum = 1

In [178]:
# Generate Arc1 to Arc10 transcipts
arc_trans = {}
with open('arc_sim.fasta','w') as outfile:
    for arc in range(S):
        name = '>Arc' + str(arc + 1)
        start = arc * len_S
        end = start + L[arc]
        if end > len_Arc:
            seq = arc_seq[start:len_Arc] + arc_seq[0:end%len_Arc]
        else: 
            seq = arc_seq[start:end]
        arc_trans[name] = seq
        outfile.write(name + '\n')
        outfile.write(seq + '\n')

In [179]:
# Generate reads
with open('arc_sim.fastq','w') as outfile:
    for read in range(N):
        # sample transcipt
        i = np.random.choice(range(0,T), p=V) # generates integer between 0-9 based on nucleotide abundance
        # pick random start position
        start = np.random.randint(0, L[i] - len_R)
        seq = arc_trans['>Arc' + str(i + 1)][start:start + len_R] 
        # above: need to add one to convert python 0-9 indexing to arc 1-10
        outfile.write('@read' + str(read) + '\n')
        outfile.write(seq + '\n')
        outfile.write('+' + '\n')
        outfile.write('I' * len_R + '\n')

## test kallisto

In [180]:
# Build kallisto index of transcriptome for sim
! kallisto index -i transcripts_sim.idx arc_sim.fasta


[build] loading fasta file arc_sim.fasta
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 19 contigs and contains 10000 k-mers 



In [181]:
# Map reads against transcriptome for sim
! kallisto quant -i transcripts_sim.idx -o output_sim --single -l 75 -s 10 arc_sim.fastq 


[quant] fragment length distribution is truncated gaussian with mean = 75, sd = 10
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 26
[quant] running in single-end mode
[quant] will process file 1: arc_sim.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 100,000 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 64 rounds



In [182]:
# Look at abundance.tsv output
with open("output_sim/abundance.tsv",'r') as infile:   
    for line in infile:
        print(line)

target_id	length	eff_length	est_counts	tpm

Arc1	4000	3926	2688.88	19817.9

Arc2	2000	1926	3403.34	51131.1

Arc3	3000	2926	28558	282416

Arc4	4000	3926	10732	79098.1

Arc5	4000	3926	12332.1	90891.5

Arc6	3000	2926	1867.71	18470.2

Arc7	2000	1926	6024.08	90504.5

Arc8	2000	1926	5364.22	80590.9

Arc9	3000	2926	3070.9	30368.8

Arc10	3000	2926	25958.7	256711



For Arcs 1, 2, 3, 5, 6, 8, 9, 10, kallisto's tpm output is closer to Moriarty's result compared to mine (top table), even with the "true" abundances used. Kallisto's inferred TPMs here are off.

## debug kallisto

Arc is unique in that it is circular and the transcipts overlap, could this be an issue? I'd like to remove the overlap so that arc is now treated like a linear sequence with no transcripts overlapping. In this no overlap condition, each read should only map to one Arc segment since there are no overlaps and the reads should be long enough to be unique.

In [184]:
# Fix everything except shorten each arc locus' length so there is no overlap and re-run simulation and kallisto
# Proportions are same as before
L = [.8 * len_S, .4 * len_S, .6 * len_S, .8 * len_S, .8 * len_S,
     .6 * len_S, .4 * len_S, .4 * len_S, .6 * len_S, .6 * len_S,]
L = [int(i) for i in L]

# Generate new shortened Arc1 to Arc10 transcipts
arc_trans = {}
with open('arc_sim_2.fasta','w') as outfile:
    for arc in range(S):
        name = '>Arc' + str(arc + 1)
        start = arc * len_S
        end = start + L[arc]
        seq = arc_seq[start:end]
        arc_trans[name] = seq
        outfile.write(name + '\n')
        outfile.write(seq + '\n')
        
# Generate reads
with open('arc_sim_2.fastq','w') as outfile:
    for read in range(N):
        # sample transcipt
        i = np.random.choice(range(0,T), p=V) # generates integer between 0-9 based on nucleotide abundance
        # pick random start position
        start = np.random.randint(0, L[i] - len_R)
        seq = arc_trans['>Arc' + str(i + 1)][start:start + len_R] 
        # above: need to add one to convert python 0-9 indexing to arc 1-10
        outfile.write('@read' + str(read) + '\n')
        outfile.write(seq + '\n')
        outfile.write('+' + '\n')
        outfile.write('I' * len_R + '\n')
        
# Build kallisto index of transcriptome for sim
! kallisto index -i transcripts_sim_2.idx arc_sim_2.fasta

# Map reads against transcriptome for sim
! kallisto quant -i transcripts_sim_2.idx -o output_sim_2 --single -l 75 -s 10 arc_sim_2.fastq 

# Look at abundance.tsv output
with open("output_sim_2/abundance.tsv",'r') as infile:   
    for line in infile:
        print(line)


[build] loading fasta file arc_sim_2.fasta
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 10 contigs and contains 5700 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 75, sd = 10
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 5,700
[index] number of equivalence classes: 10
[quant] running in single-end mode
[quant] will process file 1: arc_sim_2.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 100,000 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds

target_id	length	eff_length	est_counts	tpm

Arc1	800	726	841	5919.74

Arc2	400	326	3907	61244.8

Arc3	600	526	29271	284377

Arc4	800	726	11016	77540.8

Arc5	800	726	12771	89894.1

Arc6	600	526	772	7500.23



With no overlapping Arc transcipts, the results are now much closer to my TPM than Moriarty's. (I did not rigorously compare the differences with significance testing, but I can tell this is much closer to the "truth" compared to before via visual inspection of the values.) Kallisto seems to work fine on transcipts without overlaps, where it has no trouble mapping reads to unique transcripts. For a template transcipt with many overlaps like Arc, kallisto seems to have trouble solving this particular "non-unique problem" as discussed in class, where reads can potentially map to multiple transcipts. Kallisto's expectation maximization algorithm produces a reasonable result, with most inferred TPMs within the same order of magnitude compared to the true TPMs. However, with an additional source of deviation during expectation maximization on top of already existing randomness, the inferred TPM deviates farther from the truth.

<table>
  <thead>
    <tr>
      <th>Transcript</th>
      <th>"True" TPM</th>
      <th>TPM w/o overlap</th>
      <th>TPM w/ overlap</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Arc1</td>
      <td>6000</td>      
      <td>5920</td>
      <td>19818</td>
    </tr>
    <tr>
      <td>Arc2</td>
      <td>58000</td>      
      <td>61245</td>
      <td>51131</td>
    </tr>
      <tr>
      <td>Arc3</td>
      <td>290000</td>      
      <td>284377</td>
      <td>282416</td>
    </tr>
      <tr>
      <td>Arc4</td>
      <td>83000</td>      
      <td>77541</td>
      <td>79098</td>
    </tr>
      <tr>
      <td>Arc5</td>
      <td>94000</td>      
      <td>89894</td>
      <td>90892</td>
    </tr>
      <tr>
      <td>Arc6</td>
      <td>7800</td>      
      <td>7500</td>
      <td>18470</td>
    </tr>
      <tr>
      <td>Arc7</td>
      <td>87000</td>      
      <td>93584</td>
      <td>90504</td>
    </tr>
      <tr>
      <td>Arc8</td>
      <td>88000</td>      
      <td>93396</td>
      <td>80591</td>
    </tr>
      <tr>
      <td>Arc9</td>
      <td>22000</td>      
      <td>21393</td>
      <td>30369</td>
    </tr>
      <tr>
      <td>Arc10</td>
      <td>270000</td>      
      <td>265151</td>
      <td>256711</td>
    </tr>
  </tbody>
</table>