Skip to content
This repository has been archived by the owner on Jan 10, 2019. It is now read-only.

FASTA input files for Garnet #31

Open
agitter opened this issue Sep 4, 2018 · 13 comments
Open

FASTA input files for Garnet #31

agitter opened this issue Sep 4, 2018 · 13 comments
Assignees

Comments

@agitter
Copy link
Contributor

agitter commented Sep 4, 2018

I'm summarizing an issue that @omigueles reported via email. I'm not familiar enough with the Garnet code to resolve it myself. @sgosline suggested it may be related to TAMO.

@zfrenchee do you know who in the Fraenkel lab is actively working with Garnet and may be able to help @omigueles?

A description of the issue from @omigueles:


I was exploring the fasta files in the A549 example and I got a couple of doubts, I would really appreciate if you could help me.

In the example file the fasta headers look like this:

>hg19_chr1_4709051_4709736_+ -1
  1. Playing around, first I eliminated the "space-1" at the end, which causes the program to crash, which I understand since the program expects a header from Galaxy. I eliminated it because I wanted to understand what it means, is it the strand? I thought it was defined as + in this specific header. Besides in the bedfile it looks as if the strand was actually not known...

  2. If I change the "space-1" to "space." ( .) which should be the standard when you do not know the strand, I get completely different results for this example (33 TFs instead of 75 TFs) do you know why this could be happening?

  3. If I try with a toy example in Galaxy I see that the fasta header looks like this (even if you have a "point" in the strand field of the bed file):

>chr1:2-30 (+)

I completely understand that this could be caused by the Galaxy version used, and I can change the headers so that they look as in the example, but then what do I need to put in my header so that Garnet works properly? i.e. Would the next statements be correct?

if I know the strand :

hg19_chr1_2_30_+ +1
hg19_chr1_2_30_- -1

and if I do not know it:

hg19_chr1_2_30_+ .

Thank you for your time and your attention.

(next update:)

I re-checked Galaxy and actually you can get the same format for your header if you choose "character delimited field values" writing "_" . Interestingly the last piece of information is the fourth field of the BED file which is "name". I was also looking through some other scripts in the OmicsIntegrator website and found that in https://github.com/fraenkel-lab/OmicsIntegrator/blob/master/src/chipsequtil/Fasta.py the field after the whitespace in the header is used as Fasta ID. Could it be something like that? The program might be expecting unique IDs and by giving "-1" or "." all the time perhaps is doing something odd.

(next update:)

But my question is the same, how does the "-1" affect the output and why?

To begin with that should not be a "-1" , according to the Galaxy example it should be "peak1", maybe it is a bug in Galaxy. Still the thing is that if I run the example with this "-1" I get 75 TFs just like you do, according to:

https://github.com/fraenkel-lab/OmicsIntegrator/tree/master/tests/integration_test_standard/events_to_genes_with_motifsregression_results_FOREST_INPUT.tsv

but if I put a "." (as in "no name" in the bed file) I get 33 TFs and if I replace "-1" in every header with the "name" field from the bed file, that is "peak1,peak2... peak3108", I get 105 TFs.


This behavior is with OmicsIntegrator at commit 28d9a75, the v0.3.1 release.

One partial resolution is to update our readme to clarify that character delimited field values is required in Galaxy.

I've started to look at the Garnet code to see how the FASTA headers are used. Two instances are:

seq_ids=Fasta.load(fasta_file,key_func=lambda x: x)

where key_func is the identity function so the entire header is used as the id instead of only the second whitespace delimited token.

and

elif 'random' not in vals[0]: #galaxy tools used
genome,chr,low,high,strand=vals[0].split('_')
mid=str(int(low)+((int(high)-int(low))/2))
seq_mids.append(chr+':'+mid)

where it looks like the name for the sequence is not being used.

@alexlenail
Copy link
Contributor

@agitter Sorry to say I don't believe any of us are using GarNet anymore. @iamjli might know more than me.

@agitter
Copy link
Contributor Author

agitter commented Sep 6, 2018

@sgosline do you have any ideas about where to look next in the Garnet code? I could help debug, but I haven't read through many of the Garnet and chipsequtil files before.

@sgosline
Copy link

sgosline commented Sep 6, 2018

Yeah, the Fastq files are read in by the old chip seq util code here. I believe it was last updated by Adam Labadorf but I'm not sure. Is this being maintained at all? I haven't been using Garnet in my research.

@iamjli
Copy link

iamjli commented Sep 28, 2018

Hi all, I have a strong feeling that this issue has something to do with how values are stored in dictionaries. When these dicts are keyed by different values (ie. hg19_chr1_4709051_4709736_+ -1 or hg19_chr1_4709051_4709736_+ .), the order of dict.values() changes.

Could someone familiar with the codebase follow up on this? @sgosline @agitter

@agitter
Copy link
Contributor Author

agitter commented Sep 29, 2018

@iamjli interesting observation. I'm not familiar with the Garnet codebase, but I'll note that dictionary ordering also caused test case failures in the Forest code long ago. In particular, msgsteiner was sensitive to the order in which information was presented.

@iamjli
Copy link

iamjli commented Oct 2, 2018

Is anyone planning to debug this? Otherwise we may need to deprecate this code.

@sgosline
Copy link

sgosline commented Oct 2, 2018

That might be wise, I don't have the bandwidth to debug until mid-November.

@agitter
Copy link
Contributor Author

agitter commented Oct 3, 2018

Is there an alternative workflow that the Fraenkel lab is using to predict transcription factor activities from epigenomic and transcriptional data? If we could offer a replacement, I would be more supportive of deprecating the Garnet code.

If we don't have a replacement, then we would be removing significant functionality only 2 years after the PLOS Comp Bio paper appeared and the same year as @AmandaKedaigle's protocol paper. In that case, I would prefer to wait until @sgosline has time to assess how difficult the debugging would be or recruit someone new to help maintain the code.

@sgosline
Copy link

sgosline commented Nov 4, 2018

Hi all, I'm finally getting to this now as I'm sitting in a lot of talks/planes this week. Sorry for the delay. I'm having an issue getting the code, created #32 to track. It might be my config or the fact that I'm in Europe, not sure.

@sgosline sgosline self-assigned this Nov 5, 2018
@sgosline
Copy link

sgosline commented Nov 7, 2018

I'm still traveling but have narrowed the problem down to the scoring of the fasta files in motif_fsa_scores.py which Anthony Soltis wrote. I think it's either a dictionary key ordering issue or a threading issue. He's allegedly attending the meeting I'm traveling to so will see if he knows.

@agitter
Copy link
Contributor Author

agitter commented Nov 7, 2018

Thanks for the update @sgosline, that's encouraging.

If we need to, I would support making a behavior-breaking change that orders the dict elements as long as we bump the version number. I'd rather err on the side of having a stable implementation than matching past behavior given our limited time for maintenance. I'm not sure whether OrderedDict would be relevant here or if we should just sort the keys, which is a change we made several places in the Forest code to fix dict iteration inconsistencies.

@sgosline
Copy link

sgosline commented Nov 8, 2018

Who is administrator for this github repo? @zfrenchee ? Can you add Anthony Soltis to this project so he can participate?

@alexlenail
Copy link
Contributor

alexlenail commented Nov 8, 2018 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants