GitHub - cdeboever3/CSE280a_winter2012: Project for CSE280a, winter 2012

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
README		README
alpha_vs_error.sh		alpha_vs_error.sh
coverage_vs_alpha.sh		coverage_vs_alpha.sh
enumerate_parameters.py		enumerate_parameters.py
enumerate_parameters_fork.py		enumerate_parameters_fork.py
generate_genomes.py		generate_genomes.py
generate_sample_tsv.py		generate_sample_tsv.py
generate_sample_tsv_complex.py		generate_sample_tsv_complex.py
generate_sample_tsv_complex_fork.py		generate_sample_tsv_complex_fork.py
graphs.R		graphs.R
graphs.py		graphs.py
make_alpha_vs_error.py		make_alpha_vs_error.py
make_coverage_vs_alpha.py		make_coverage_vs_alpha.py
make_reads.py		make_reads.py
make_variant_calling_input.py		make_variant_calling_input.py
mixed_variant_calling.py		mixed_variant_calling.py
mixed_variant_calling_fork.py		mixed_variant_calling_fork.py
mixed_variant_calling_fork_fork.py		mixed_variant_calling_fork_fork.py
pileup_to_tsv.py		pileup_to_tsv.py
reads.1.fq		reads.1.fq
reads.2.fq		reads.2.fq
sample_genome.py		sample_genome.py
simulate_reads.py		simulate_reads.py
test.tsv		test.tsv
test_genotypes.tsv		test_genotypes.tsv
wgsim		wgsim
writeup.tex		writeup.tex

Repository files navigation

Code for our CSE280a project!

wgsim is a read simulator from https://github.com/lh3/wgsim

Project description from Vineet:

5 SNV calling in mixed samples
Consider a project in which you are calling variants in a donor (relative to a reference) that may have some disease. Unfortunately (as with tumor samples), the donor sample might be a mix of normal (fraction α of the cells) and diseased (1 − α) tissue, where the value of α is unknown. Most SNV calling pipelines assume a pure sample and call variants assuming that each variant is homozygous (all reads have the same allele) or heterozygous (the two alleles appear in a 50:50 ratio), and are confused by such mixed samples.
5.1 Input
All input is simulated, and generating it is part of the project.
5.2 Output
1. A program that simulates reads from a fictitious donor with a mix of two genomes (normal and diseased). First create two genomes (Use 100Mbp of the real reference from one of the chromosomes) with variations in one of the two. The variations are somatic, and are very sparse. Then, use the bwa read simulator to simulate reads from the two genomes sampled at rate α, 1 − α. The number of reads is controlled by a parameter c, denoting average coverage of all base-pairs.
2. Devise an algorithm for identifying variants. Test performance by looking at false-positives, false- negatives in multiple samples for different values of coverage, α, and read error-rate.
3. Your algorithm should also provide an estimate of α.

## Prerequisites ##
Downloaded hg19 from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
Extract into directory 'hg19'

## Example commands ##

# Output random 10Mb region from hg19 to file ref.fa
sample_genome.py --genome-dir hg19 --output ref.fa --sample-length 1e7

# Generate an individual's normal and cancer diploid genomes using
# ref.fa as the reference. Output genomes into n1.fa (normal genome)
# and c1.fa (cancer genome). Output mutation log files into n1.var and c1.var.
generate_genomes.py --ref ref.fa --k n1 c1 --pl 2 --germ-mu 1e-3 --som-mu 1e-3 --output-dir .

# Simulate 1e4 reads from the 10Mb region ref.fa. Reads come from an
# individual's normal and cancer diploid genomes at proportions of
# 0.25 and 0.75, respectively. Germline mutation rate is 1e-3 and
# somatic mut rate is 1e-5. Parallelize with two processes.
simulate_reads.py --genomes n1.fa c1.fa --alpha 0.25 0.75 --reads 1e4 --reads1 reads.1.fq --reads2 reads.2.fq --p 2

# Generate a sample tsv file for input into mixed_variant_calling.py. This is a naive simulation mostly meant for debugging:
python generate_sample_tsv.py > test.tsv

# Run mixed_variant_calling.py on a properly formatted tsv file
python mixed_variant_calling.py test.tsv > test_genotypes.tsv