# Simulate a transcriptome with TE transcripts using polyester

1. Generate transcriptome, index it, and simulate reads ([see code](./make_l1hs_chr22_txome.py))
 - Spliced and unspliced transcripts from GENCODE annotation
 - L1 transcripts from full-length L1HS annotations in reference genome

2. Quantify reads with salmon
 - build index of transcriptome (use same transcriptome from step 1)
 - quantify reads with salmon

3. Compare with original count matrix

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from src.bench import BenchmarkSalmon

In [3]:
bm = BenchmarkSalmon(
    reads_dir="../resources/chr22_l1hs_txome/sim_reads_1",
    txome_dir="../resources/chr22_l1hs_txome",
    outdir="../resources/chr22_l1hs_txome/sim_reads_1_salmon",
)
# bm.index()
bm.quant()
benchmark = bm.read_counts()

[INFO:src.bench: 11-01 21:28:14] - Running Salmon quant for sample_01
[INFO:src.bench: 11-01 21:29:08] - Running Salmon quant for sample_02
[INFO:src.bench: 11-01 21:30:05] - Running Salmon quant for sample_03
[INFO:src.bench: 11-01 21:31:03] - Running Salmon quant for sample_04
[INFO:src.bench: 11-01 21:31:59] - Running Salmon quant for sample_05


In [4]:
bm.plot_difference()
bm.plot_l1hs()

ValueError: Must run read_counts() first!

In [None]:
bm = BenchmarkSalmon(
    reads_dir="../resources/chr22_l1hs_txome/sim_reads_2",
    txome_dir="../resources/chr22_l1hs_txome",
    outdir="../resources/chr22_l1hs_txome/sim_reads_2_salmon",
)
# bm.index()
bm.quant()
benchmark = bm.read_counts()

bm.plot_difference()
bm.plot_l1hs()

[INFO:src.bench: 11-01 21:22:03] - Running Salmon quant for sample_01
[INFO:src.bench: 11-01 21:23:02] - Running Salmon quant for sample_02


In [None]:
bm = BenchmarkSalmon(
    reads_dir="../resources/chr22_l1hs_txome/sim_reads_3",
    txome_dir="../resources/chr22_l1hs_txome",
    outdir="../resources/chr22_l1hs_txome/sim_reads_3_salmon",
)
# bm.index()
bm.quant()
benchmark = bm.read_counts()

bm.plot_difference()
bm.plot_l1hs()

In [None]:
bm = BenchmarkSalmon(
    reads_dir="../resources/chr22_l1hs_txome/sim_reads_4",
    txome_dir="../resources/chr22_l1hs_txome",
    outdir="../resources/chr22_l1hs_txome/sim_reads_4_salmon",
)
# bm.index()
bm.quant()
benchmark = bm.read_counts()

bm.plot_difference()
bm.plot_l1hs()

In [None]:
bm = BenchmarkSalmon(
    reads_dir="../resources/chr22_l1hs_txome/sim_reads_5",
    txome_dir="../resources/chr22_l1hs_txome",
    outdir="../resources/chr22_l1hs_txome/sim_reads_5_salmon",
)

bm.index()
bm.quant()
benchmark = bm.read_counts()

bm.plot_difference()
bm.plot_l1hs()

## Next steps:

1. How does this compare to TEtranscripts, SQuIRE, and L1EM??
	- add each program to conda environment
	- write subclass in `src/bench.py` to run each program 
2. Simulate reads with non-uniform distributions in L1 only. Choose different number for each L1 in each sample. Add to `make_l1hs_chr22_txome.py` script.
3. Simulate reads with non-uniform distributions in non-L1, constant L1. Add to `make_l1hs_chr22_txome.py` script.

	To generate a more realistic simulation, see Salmon Paper methods on Polyester simulations https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5600148/

4. Try different salmon indexing (-k) and quantification parameters (e.g. --seqBias, --gcBias, --posBias)
5. Inspect intermediate results of salmon (where are L1 reads mapping?)