FASTA input files for Garnet #31

agitter · 2018-09-04T11:41:07Z

I'm summarizing an issue that @omigueles reported via email. I'm not familiar enough with the Garnet code to resolve it myself. @sgosline suggested it may be related to TAMO.

@zfrenchee do you know who in the Fraenkel lab is actively working with Garnet and may be able to help @omigueles?

A description of the issue from @omigueles:

I was exploring the fasta files in the A549 example and I got a couple of doubts, I would really appreciate if you could help me.

In the example file the fasta headers look like this:

>hg19_chr1_4709051_4709736_+ -1

Playing around, first I eliminated the "space-1" at the end, which causes the program to crash, which I understand since the program expects a header from Galaxy. I eliminated it because I wanted to understand what it means, is it the strand? I thought it was defined as + in this specific header. Besides in the bedfile it looks as if the strand was actually not known...
If I change the "space-1" to "space." ( .) which should be the standard when you do not know the strand, I get completely different results for this example (33 TFs instead of 75 TFs) do you know why this could be happening?
If I try with a toy example in Galaxy I see that the fasta header looks like this (even if you have a "point" in the strand field of the bed file):

>chr1:2-30 (+)

I completely understand that this could be caused by the Galaxy version used, and I can change the headers so that they look as in the example, but then what do I need to put in my header so that Garnet works properly? i.e. Would the next statements be correct?

if I know the strand :

hg19_chr1_2_30_+ +1
hg19_chr1_2_30_- -1

and if I do not know it:

hg19_chr1_2_30_+ .

Thank you for your time and your attention.

(next update:)

I re-checked Galaxy and actually you can get the same format for your header if you choose "character delimited field values" writing "_" . Interestingly the last piece of information is the fourth field of the BED file which is "name". I was also looking through some other scripts in the OmicsIntegrator website and found that in https://github.com/fraenkel-lab/OmicsIntegrator/blob/master/src/chipsequtil/Fasta.py the field after the whitespace in the header is used as Fasta ID. Could it be something like that? The program might be expecting unique IDs and by giving "-1" or "." all the time perhaps is doing something odd.

(next update:)

But my question is the same, how does the "-1" affect the output and why?

To begin with that should not be a "-1" , according to the Galaxy example it should be "peak1", maybe it is a bug in Galaxy. Still the thing is that if I run the example with this "-1" I get 75 TFs just like you do, according to:

https://github.com/fraenkel-lab/OmicsIntegrator/tree/master/tests/integration_test_standard/events_to_genes_with_motifsregression_results_FOREST_INPUT.tsv

but if I put a "." (as in "no name" in the bed file) I get 33 TFs and if I replace "-1" in every header with the "name" field from the bed file, that is "peak1,peak2... peak3108", I get 105 TFs.

This behavior is with OmicsIntegrator at commit 28d9a75, the v0.3.1 release.

One partial resolution is to update our readme to clarify that character delimited field values is required in Galaxy.

I've started to look at the Garnet code to see how the FASTA headers are used. Two instances are:

OmicsIntegrator/scripts/get_window_binding_matrix.py

Line 23 in 08b5548

seq_ids=Fasta.load(fasta_file,key_func=lambda x: x)

where key_func is the identity function so the entire header is used as the id instead of only the second whitespace delimited token.

and

OmicsIntegrator/scripts/get_window_binding_matrix.py

Lines 39 to 42 in 08b5548

    
           elif 'random' not in vals[0]: #galaxy tools used 
        
               genome,chr,low,high,strand=vals[0].split('_') 
        
               mid=str(int(low)+((int(high)-int(low))/2)) 
        
               seq_mids.append(chr+':'+mid)

where it looks like the name for the sequence is not being used.

The text was updated successfully, but these errors were encountered:

alexlenail · 2018-09-04T13:04:40Z

@agitter Sorry to say I don't believe any of us are using GarNet anymore. @iamjli might know more than me.

agitter · 2018-09-06T11:20:59Z

@sgosline do you have any ideas about where to look next in the Garnet code? I could help debug, but I haven't read through many of the Garnet and chipsequtil files before.

sgosline · 2018-09-06T16:33:12Z

Yeah, the Fastq files are read in by the old chip seq util code here. I believe it was last updated by Adam Labadorf but I'm not sure. Is this being maintained at all? I haven't been using Garnet in my research.

iamjli · 2018-09-28T17:30:53Z

Hi all, I have a strong feeling that this issue has something to do with how values are stored in dictionaries. When these dicts are keyed by different values (ie. hg19_chr1_4709051_4709736_+ -1 or hg19_chr1_4709051_4709736_+ .), the order of dict.values() changes.

Could someone familiar with the codebase follow up on this? @sgosline @agitter

agitter · 2018-09-29T12:01:14Z

@iamjli interesting observation. I'm not familiar with the Garnet codebase, but I'll note that dictionary ordering also caused test case failures in the Forest code long ago. In particular, msgsteiner was sensitive to the order in which information was presented.

iamjli · 2018-10-02T16:41:48Z

Is anyone planning to debug this? Otherwise we may need to deprecate this code.

sgosline · 2018-10-02T16:54:35Z

That might be wise, I don't have the bandwidth to debug until mid-November.

agitter · 2018-10-03T11:18:09Z

Is there an alternative workflow that the Fraenkel lab is using to predict transcription factor activities from epigenomic and transcriptional data? If we could offer a replacement, I would be more supportive of deprecating the Garnet code.

If we don't have a replacement, then we would be removing significant functionality only 2 years after the PLOS Comp Bio paper appeared and the same year as @AmandaKedaigle's protocol paper. In that case, I would prefer to wait until @sgosline has time to assess how difficult the debugging would be or recruit someone new to help maintain the code.

sgosline · 2018-11-04T12:28:43Z

Hi all, I'm finally getting to this now as I'm sitting in a lot of talks/planes this week. Sorry for the delay. I'm having an issue getting the code, created #32 to track. It might be my config or the fact that I'm in Europe, not sure.

sgosline · 2018-11-07T17:03:05Z

I'm still traveling but have narrowed the problem down to the scoring of the fasta files in motif_fsa_scores.py which Anthony Soltis wrote. I think it's either a dictionary key ordering issue or a threading issue. He's allegedly attending the meeting I'm traveling to so will see if he knows.

agitter · 2018-11-07T19:04:35Z

Thanks for the update @sgosline, that's encouraging.

If we need to, I would support making a behavior-breaking change that orders the dict elements as long as we bump the version number. I'd rather err on the side of having a stable implementation than matching past behavior given our limited time for maintenance. I'm not sure whether OrderedDict would be relevant here or if we should just sort the keys, which is a change we made several places in the Forest code to fix dict iteration inconsistencies.

sgosline · 2018-11-08T13:58:11Z

Who is administrator for this github repo? @zfrenchee ? Can you add Anthony Soltis to this project so he can participate?

alexlenail · 2018-11-08T14:06:44Z

He's invited. Anthony, check out: https://github.com/fraenkel-lab/OmicsIntegrator/invitations

…

On Thu, Nov 8, 2018 at 8:58 AM Sara JC Gosline ***@***.***> wrote: Who is administrator for this github repo? @zfrenchee <https://github.com/zfrenchee> ? Can you add Anthony Soltis to this project so he can participate? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACojfR3NrEAwSbsu81nu0PDSQX2_5YlLks5utDh0gaJpZM4WY0Rp> .

-- Alex Lenail Tufts University '16

sgosline self-assigned this Nov 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FASTA input files for Garnet #31

FASTA input files for Garnet #31

agitter commented Sep 4, 2018

alexlenail commented Sep 4, 2018

agitter commented Sep 6, 2018

sgosline commented Sep 6, 2018

iamjli commented Sep 28, 2018

agitter commented Sep 29, 2018

iamjli commented Oct 2, 2018

sgosline commented Oct 2, 2018

agitter commented Oct 3, 2018

sgosline commented Nov 4, 2018 •

edited

Loading

sgosline commented Nov 7, 2018

agitter commented Nov 7, 2018

sgosline commented Nov 8, 2018

alexlenail commented Nov 8, 2018 via email

FASTA input files for Garnet #31

FASTA input files for Garnet #31

Comments

agitter commented Sep 4, 2018

alexlenail commented Sep 4, 2018

agitter commented Sep 6, 2018

sgosline commented Sep 6, 2018

iamjli commented Sep 28, 2018

agitter commented Sep 29, 2018

iamjli commented Oct 2, 2018

sgosline commented Oct 2, 2018

agitter commented Oct 3, 2018

sgosline commented Nov 4, 2018 • edited Loading

sgosline commented Nov 7, 2018

agitter commented Nov 7, 2018

sgosline commented Nov 8, 2018

alexlenail commented Nov 8, 2018 via email

sgosline commented Nov 4, 2018 •

edited

Loading