Bioinformatics topic ideas #7

mbonsma · 2015-06-15T18:54:57Z

Suggestions and ideas for topics we'd like to see covered at some point. This thread is for bioinformatics-related topics, or things that people working with genetic data might find useful. Posting a suggestion does not lock you in to presenting on that topic.

ricardoharripaul · 2015-06-15T18:56:26Z

Variant Calling
Machine learning with Genetic Data
Building Biological Databases
Advanced Statistical Tools would be good for everyone

mbonsma · 2015-06-15T18:56:45Z

SeqIO in BioPerl or BioPython. My vote's for Python, but maybe it could be taught in a way that's applicable to both if people are interested in both?
edit Here's a possible lesson plan

linamnt · 2015-06-15T18:57:46Z

A bit more specific to people who do network analysis but tools such as iGraph in R, or Networkx in python might be useful for some?

ricardoharripaul · 2015-06-15T18:58:14Z

yeah R also has a package for for SeqIO.

I wrote some code in python that can work with BAM files and fastq/a
files. I used it to extract out certain types of reads, filtering. You
can also use it for motif finding.

On Mon, Jun 15, 2015 at 2:56 PM, mbonsma notifications@github.com wrote:

SeqIO http://biopython.org/wiki/SeqIO in BioPerl or BioPython. My
vote's for Python, but maybe it could be taught in a way that's applicable
to both if people are interested in both?

—
Reply to this email directly or view it on GitHub
https://github.com/mbonsma/studyGroup/issues/7#issuecomment-112173166.

mbonsma · 2015-06-15T19:00:44Z

This is so great. Ideas! If we even had a session where everyone who works with fasta files just talks about their life, I would be so happy. Haha.

MattStata · 2015-06-18T00:13:50Z

I've been using NetworkX in Python for a while now, to do a variety of things, particularly focused around the use of the MCL algorithm for network clustering. I'd love to learn more about visualizing networks using NetworkX + matplotlib if anyone has any expertise in that? Or just matplotlib in general, really.

MattStata · 2015-06-18T00:40:00Z

Also, I could present on de novo transcriptome assembly or phylogenomics (the two big areas I've been devoting time to lately), as well as general stuff like BLAST and variant tools, sequence alignment, building gene trees, etc, if there's interest in beginner type stuff.

ricardoharripaul · 2015-06-18T02:52:26Z

I have experience in matplotlib. It is quite nice to use a script for generating figures because you can change things systematically and easily for different journals.

ricardoharripaul · 2015-06-18T02:53:03Z

As well,

I can do variant calling, bisulfite alignment, and reference guided alignment. I seem to be doing that a log lately.

MattStata · 2015-06-18T02:58:09Z

Oh great! We should chat. I could use some recommendations for variant calling, but in a highly specific context.

ricardoharripaul · 2015-06-18T03:00:15Z

Sure.

On Wed, Jun 17, 2015 at 10:58 PM, MattStata notifications@github.com
wrote:

Oh great! We should chat. I could use some recommendations for variant
calling, but in a highly specific context.

—
Reply to this email directly or view it on GitHub
https://github.com/mbonsma/studyGroup/issues/7#issuecomment-113022297.

QuLogic · 2015-06-18T03:06:00Z

@MattStata @ricardoharripaul #2 ...

ricardoharripaul · 2015-06-18T15:21:28Z

Hi Matt,

Did you have any questions? I am not sure how you want to communicate.

On Wed, Jun 17, 2015 at 11:06 PM, Elliott Sales de Andrade <
notifications@github.com> wrote:

@MattStata https://github.com/MattStata @ricardoharripaul
https://github.com/ricardoharripaul #2
mbonsma#2 ...

—
Reply to this email directly or view it on GitHub
https://github.com/mbonsma/studyGroup/issues/7#issuecomment-113022935.

MattStata · 2015-06-19T01:55:36Z

Well basically, I'm looking for a program that can use RNA-seq reads against a set of coding sequences I've assembled and identify putative SNPs. Do you have any recommendations? I've never used any SNP-calling software before so I'm not really sure what's out there and what the required inputs for most are -- I would assume the majority map genomic reads against a genome, rather than RNA-seq against CDS.

As for communicating, I think here is probably fine, as long as this doesn't turn into a really lengthy side-discussion and totally derail the thread.

ricardoharripaul · 2015-06-19T02:03:33Z

Hi Matt,

So you already know your SNPs and have their sequence or position? What
are you mapping against? There is no reference and nothing similar?

It makes a difference. Your problem reminds me of a targeted sequencing
problem.

Ricardo

On Thu, Jun 18, 2015 at 9:55 PM, Matt Stata notifications@github.com
wrote:

Well basically, I'm looking for a program that can use RNA-seq reads
against a set of coding sequences I've assembled and identify putative
SNPs. Do you have any recommendations? I've never used any SNP-calling
software before so I'm not really sure what's out there and what the
required inputs for most are -- I would assume the majority map genomic
reads against a genome, rather than RNA-seq against CDS.

As for communicating, I think here is probably fine, as long as this
doesn't turn into a really lengthy side-discussion and totally derail the
thread.

—
Reply to this email directly or view it on GitHub
#7 (comment)
.

MattStata · 2015-06-20T15:02:22Z

I have no reference. I have de novo transcriptome assemblies for two plant species, from which I've extracted orthologous pairs of coding sequences. I would like to now use the original reads from three different individuals to get some idea of the genetic diversity and in particular the degree of heterozygosity, in the interest of deciding whether we need to self the plants several times to reduce heterozygosity before starting a genome sequencing project. I could write something to do this using BLAT results for the read mapping or something, but if there is an existing tool that would save me some trouble. I imagine there must be something that is either intended for this or flexible enough to use in this situation?

ricardoharripaul · 2015-06-20T15:39:51Z

Hi Matt,

I am pretty sure BLAT would be too slow. Did you use Trinity or ABYSS for
de novo assembly? I am wondering if you can use like a bowtie and present
the scaffolds from the de novo assemble as the reference and map like that.

Have you looked into PAGAN?

https://code.google.com/p/pagan-msa/wiki/PAGAN?tm=6

On Sat, Jun 20, 2015 at 11:02 AM, Matt Stata notifications@github.com
wrote:

I have no reference. I have de novo transcriptome assemblies for two plant
species, from which I've extracted orthologous pairs of coding sequences. I
would like to now use the original reads from three different individuals
to get some idea of the genetic diversity and in particular the degree of
heterozygosity, in the interest of deciding whether we need to self the
plants several times to reduce heterozygosity before starting a genome
sequencing project. I could write something to do this using BLAT results
for the read mapping or something, but if there is an existing tool that
would save me some trouble. I imagine there must be something that is
either intended for this or flexible enough to use in this situation?

—
Reply to this email directly or view it on GitHub
#7 (comment)
.

MattStata · 2015-06-20T18:48:51Z

BLAT actually works quite well for mapping reads, and can be really fast with the right settings and run in parallel with gnu parallel. I've used it quite a bit for that.

My pipeline, which I'm still refining, is something like this:

-Multiple assemblies (Trinity, IDBA, SOAPDeNovo-Trans) combined
-Predict ORFs with EMBOSS "getorf" and take the three longest per assembled transcript, above a certain minimum length threshold, using a Python script (this of course introduces a lot of spurious ORFs, but they're filtered out over the next steps).
-Remove duplicate ORFs with another Python script
-Merge highly similar ORFs and take a single representative using CD-HIT in order to reduce redundancy
-BLAST the ORFs for my two species against each other and take reciprocal best hits to further reduce redundancy
-BLAST what remains against selected genomes in the Phytozome v10.2, and eliminate anything without a good match, to remove totally spurious ORFs that might remain
-Cluster the results of the two BLAST comparisons using MCL in order to group my assemblies with their orthologs in other species for functional annotation

Basically this is all aimed at getting a good set of pairwise orthologs with predicted function, so that I can then do further downstream analysis with interspecific hybrids of these two species. But as I mentioned, we also intend to eventually sequence the genomes of the two parent species and so would like to get a rough idea of the degree of heterozygosity so as to decide whether we need to self a few more times before starting sequencing. So I'd like to see what SNPs exist in my coding regions (particularly percentage of SNPs at synonymous sites, which should be a reasonable approximation of SNPs for the other neutral parts of the genome) for each species.

PAGAN sounds interesting, but I don't see how it fits my problem -- were you suggesting it just as an alternative to BLAT?

ricardoharripaul · 2015-06-21T16:28:46Z

I was suggesting PAGAN instead of BLAT

Have you seen this paper? It implements a statistical approach to estimate
heterozygosity.

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0020041

Is it possible to get frequencies from your data on SNPs?

https://books.google.ca/books?id=hn9ivqTpSKAC&pg=PA371&lpg=PA371&dq=computationally+find+heterozygosity&source=bl&ots=iVNaa3HwUL&sig=e0jy5i7GY9v7azbFoJee_uwa8Nw&hl=en&sa=X&ei=jOSGVd-1AYiUyQS355TIBg&ved=0CEUQ6AEwBQ#v=onepage&q=computationally%20find%20heterozygosity&f=false

Ricardo

On Sat, Jun 20, 2015 at 2:48 PM, Matt Stata notifications@github.com
wrote:

BLAT actually works quite well for mapping reads, and can be really fast
with the right settings and run in parallel with gnu parallel. I've used it
quite a bit for that.

My pipeline, which I'm still refining, is something like this:

-Multiple assemblies (Trinity, IDBA, SOAPDeNovo-Trans) combined
-Predict ORFs with EMBOSS "getorf" and take the three longest per
assembled transcript, above a certain minimum length threshold, using a
Python script (this of course introduces a lot of spurious ORFs, but
they're filtered out over the next steps).
-Remove duplicate ORFs with another Python script
-Merge highly similar ORFs and take a single representative using CD-HIT
in order to reduce redundancy
-BLAST the ORFs for my two species against each other and take reciprocal
best hits to further reduce redundancy
-BLAST what remains against selected genomes in the Phytozome v10.2, and
eliminate anything without a good match, to remove totally spurious ORFs
that might remain
-Cluster the results of the two BLAST comparisons using MCL in order to
group my assemblies with their orthologs in other species for functional
annotation

Basically this is all aimed at getting a good set of pairwise orthologs
with predicted function, so that I can then do further downstream analysis
with interspecific hybrids of these two species. But as I mentioned, we
also intend to eventually sequence the genomes of the two parent species
and so would like to get a rough idea of the degree of heterozygosity so as
to decide whether we need to self a few more times before starting
sequencing. So I'd like to see what SNPs exist in my coding regions
(particularly percentage of SNPs at synonymous sites, which should be a
reasonable approximation of SNPs for the other neutral parts of the genome)
for each species.

PAGAN sounds interesting, but I don't see how it fits my problem -- were
you suggesting it just as an alternative to BLAT?

—
Reply to this email directly or view it on GitHub
#7 (comment)
.

lwjohnst86 closed this as completed Aug 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bioinformatics topic ideas #7

Bioinformatics topic ideas #7

mbonsma commented Jun 15, 2015

ricardoharripaul commented Jun 15, 2015

mbonsma commented Jun 15, 2015

linamnt commented Jun 15, 2015

ricardoharripaul commented Jun 15, 2015

mbonsma commented Jun 15, 2015

MattStata commented Jun 18, 2015

MattStata commented Jun 18, 2015

ricardoharripaul commented Jun 18, 2015

ricardoharripaul commented Jun 18, 2015

MattStata commented Jun 18, 2015

ricardoharripaul commented Jun 18, 2015

QuLogic commented Jun 18, 2015

ricardoharripaul commented Jun 18, 2015

MattStata commented Jun 19, 2015

ricardoharripaul commented Jun 19, 2015

MattStata commented Jun 20, 2015

ricardoharripaul commented Jun 20, 2015

MattStata commented Jun 20, 2015

ricardoharripaul commented Jun 21, 2015

Bioinformatics topic ideas #7

Bioinformatics topic ideas #7

Comments

mbonsma commented Jun 15, 2015

ricardoharripaul commented Jun 15, 2015

mbonsma commented Jun 15, 2015

linamnt commented Jun 15, 2015

ricardoharripaul commented Jun 15, 2015

mbonsma commented Jun 15, 2015

MattStata commented Jun 18, 2015

MattStata commented Jun 18, 2015

ricardoharripaul commented Jun 18, 2015

ricardoharripaul commented Jun 18, 2015

MattStata commented Jun 18, 2015

ricardoharripaul commented Jun 18, 2015

QuLogic commented Jun 18, 2015

ricardoharripaul commented Jun 18, 2015

MattStata commented Jun 19, 2015

ricardoharripaul commented Jun 19, 2015

MattStata commented Jun 20, 2015

ricardoharripaul commented Jun 20, 2015

MattStata commented Jun 20, 2015

ricardoharripaul commented Jun 21, 2015