Skip to content

A repository for code that generates various simulated data.

Notifications You must be signed in to change notification settings

fomightez/simulated_data

Repository files navigation

simulated_data

A repository for code that generates various simulated data.

Subfolders found here concern:

Related efforts by others

"What is a good and easy way to select 5-10 Mb of neutral DNA sequences in the human genome? Would selecting random intergenic regions (say >10Kb away from genes) be enough? Has someone done something similar recently in a paper I could cite and use the same loci?"
https://twitter.com/vsbuffalo/status/1646212322833334272
"I had to do this recently — I took all exonic + phastcons + UTRS, merged them, and then add 200bp of buffer on both ends (all using bedtools). You could do this and even select out random regions. I did some sensitivity analysis and comparison to the CADD tracks and seemed good."
"Also (and perhaps this is being too paranoid) but I merged the refseq and ensembl tracks. They differ slightly in their percent of basepairs that annotated as coding, so I took the union."

"Go and grab 130G of long-read mock microbial community data from PromethION and 36G from MinION over here, if you fancy: https://github.com/LomanLab/mockcommunity … #UKGS18 - could be useful for bioinformatics pipeline validation and method development!"

https://twitter.com/Hasindu2008/status/1628569325895585793

"Squigulator r10 branch https://github.com/hasindu2008/squigulator/tree/r10 can simulate r10.4.1 signals. Also f5c r10 branch https://github.com/hasindu2008/f5c/tree/r10 can do resquiggle and eventalign for R10.4.1. Note: still work in progress and improvements are on the way. Thanks, @nanopore for providing the pore-model."

"To test different approaches for assembling genomes, I needed data with known microbial content. Only long reads were available, but I needed to test the algorithm on short paired-end reads. This script was written to create short reads from long reads."

Single cell:

"Our single-cell and spatial omics simulator scDesign3 is now online: https://nature.com/articles/s41587-023-01772-1 scDesign3 has two functionalities: (1) synthetic data simulation and (2) real data interpretation and modification 1/" (The scDesign3 Github repo)

  • Splatter: Simulation Of Single-Cell RNA-seq - Bioconductor (R) package for simple, reproducible and well-documented simulation of single-cell RNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types or differentiation paths.

Be sure to also see 'Related software by others I have encountered' on the page Scripts and resources for generating simulated data for gene expression analysis because sometimes I add to there and not here and vice versa.

Only Tangentially Related

https://x.com/RobAboukhalil/status/1808602458698232258 July 2024

"When writing bioinformatics tools, I often need small datasets to test edge cases or invalid file formats, e.g. files that are truncated, unsorted, have extraneous whitespace, etc. I started compiling examples here: https://github.com/omgenomics/bio-data-zoo, contributions are welcome!" Bio Data Zoo
>"This repo contains example data in various genomics file formats. It is intended for bioinformatics tool developers to make testing software easier. It includes examples of valid file formats, edge cases, and invalid formats."