Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
How to filter genotype RDD with FeatureRDD #890
How can I filter a genotypeRDD with a FeatureRDD? I get the following error:
with this code:
Do I need to convert the genotypeRDD and FeatureRDD to a ReferenceRegionRDD ? Is this an implicit conversion done automatically by importing a certain class?
You'll need to key each RDD with a ReferenceRegion, e.g.:
Hi @fnothaft . Thank you for the information, needed to move some parentheses but the command starts to execute.
Tasks of the operation keep failing thoug and keep being resubmitted:
Is this by any chance a very resource intensive operation? I am running this on 11 m3.xlarge machines. Or is there another likely cause for the taks that keep failing / being resubmitted. In the end I just aborted the process.
The annoation file that I am using is
(removed the chr from the chromosome names to match the chromosome names in the 1000 genomes vcf file, reference version should match.)
It shouldn't be terribly expensive, however you may want to try running the ShuffleRegionJoin instead of the BroadcastRegionJoin. I'll need to look over the GTF loading code, but I'm not sure off of the top of my head whether that code maps each feature in a GTF to a
Hi @fnothaft. Thank you for the information. I can try to increase the driver memory and the ShuffleRegionJoin.
The vcf file I am testing with is
Some searching online led me to this gene annotation file (the previous one mentioned is for an older genome build).
The basic thing I am trying to achieve is just to see how long it takes with Adam to filter a large set of genotypes based on a gene model. It doesn't need to be this exact gene model, or this detailed of a gene model. If you have a gene model which is compatible with the 1000 genomes data I would be happy to use that one.
I am only using chr22 because the full data set is much bigger. Queries with just chr22 already take some time. And I have other queries that I've tested on chr22. I'll filter the gene model (gtf file) also for just the chr22 gene annotations, before converting to Feature format.
I will post my results here after testing with the correct genome build and the increased driver memory and or ShuffleRegionJoin method.
Increasing the driver memory from 8 to 14 GB (is max machine mem) did not help. Still the same errors.
I am now trying to use the
On what should I base the partitionSize? If I set this to 3 (example that I found in Adam test forShuffleRegionJoin ) or 11 (number of nodes) I get an out of memory error.
If I set partitionSize to 100 I get a lot of small tasks. And processing takes a long time, I did not wait for it to finish.
Do you maybe have an example somewhere with a public vcf file and feature file of real sizes that works, ie the filter a genotypeRDD with BroadCastRegoinJoin or ShuffleRegionJoin. I could then look if I can run that example.
Maybe the issue is just with my gene annotation file.....
I describe something similar here
The gene feature ADAM file came from Ensembl and was filtered via
val features = sc.loadFeatures("Homo_sapiens.GRCh38.89.chr.gff3.gz") val geneFeatures = features.transform(_.filter(f => f.featureType == "gene")) geneFeatures.saveAsParquet("Homo_sapiens.GRCh38.89.chr.geneFeatures.adam")