-
Notifications
You must be signed in to change notification settings - Fork 1
cdeboever3/CSE280a_winter2012
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Code for our CSE280a project! wgsim is a read simulator from https://github.com/lh3/wgsim Project description from Vineet: 5 SNV calling in mixed samples Consider a project in which you are calling variants in a donor (relative to a reference) that may have some disease. Unfortunately (as with tumor samples), the donor sample might be a mix of normal (fraction α of the cells) and diseased (1 − α) tissue, where the value of α is unknown. Most SNV calling pipelines assume a pure sample and call variants assuming that each variant is homozygous (all reads have the same allele) or heterozygous (the two alleles appear in a 50:50 ratio), and are confused by such mixed samples. 5.1 Input All input is simulated, and generating it is part of the project. 5.2 Output 1. A program that simulates reads from a fictitious donor with a mix of two genomes (normal and diseased). First create two genomes (Use 100Mbp of the real reference from one of the chromosomes) with variations in one of the two. The variations are somatic, and are very sparse. Then, use the bwa read simulator to simulate reads from the two genomes sampled at rate α, 1 − α. The number of reads is controlled by a parameter c, denoting average coverage of all base-pairs. 2. Devise an algorithm for identifying variants. Test performance by looking at false-positives, false- negatives in multiple samples for different values of coverage, α, and read error-rate. 3. Your algorithm should also provide an estimate of α. ## Prerequisites ## Downloaded hg19 from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz Extract into directory 'hg19' ## Example commands ## # Output random 10Mb region from hg19 to file ref.fa sample_genome.py --genome-dir hg19 --output ref.fa --sample-length 1e7 # Generate an individual's normal and cancer diploid genomes using # ref.fa as the reference. Output genomes into n1.fa (normal genome) # and c1.fa (cancer genome). Output mutation log files into n1.var and c1.var. generate_genomes.py --ref ref.fa --k n1 c1 --pl 2 --germ-mu 1e-3 --som-mu 1e-3 --output-dir . # Simulate 1e4 reads from the 10Mb region ref.fa. Reads come from an # individual's normal and cancer diploid genomes at proportions of # 0.25 and 0.75, respectively. Germline mutation rate is 1e-3 and # somatic mut rate is 1e-5. Parallelize with two processes. simulate_reads.py --genomes n1.fa c1.fa --alpha 0.25 0.75 --reads 1e4 --reads1 reads.1.fq --reads2 reads.2.fq --p 2 # Generate a sample tsv file for input into mixed_variant_calling.py. This is a naive simulation mostly meant for debugging: python generate_sample_tsv.py > test.tsv # Run mixed_variant_calling.py on a properly formatted tsv file python mixed_variant_calling.py test.tsv > test_genotypes.tsv
About
Project for CSE280a, winter 2012
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published