-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError: assert len(merged_df) == chip_df_nb_bins #31
Comments
(Just thinking out loud below): This is the code: def _merge_chip_and_input(chip_df, input_df):
chip_df_nb_bins = len(chip_df)
merged_df = chip_df.merge(input_df,
how="left",
on=["Chromosome", "Bin"],
suffixes=[" ChIP", " Input"])
merged_df = merged_df[["Chromosome", "Bin", "Count ChIP", "Count Input"]]
merged_df.columns = ["Chromosome", "Bin", "ChIP", "Input"]
merged_df = merged_df.fillna(0)
assert len(merged_df) == chip_df_nb_bins
return merged_df It just left merges (left outer join in SQL-speak) the chip and input dataframes. This means that it should keep all the bins (e.g. |
I suspect this is a pathological edge-case, just cannot understand which. I will add a debug mode with much more info printed to the screen so we can track down the error. Could you please add the command line args you used? |
@balwierz The new function has an assertion message that indicates the input to the function and the resulting output. I hope it will be enough to debug the error. I have no guess about what the reason for the error might be, but I hope it it is not that the files represent different chromosomes. |
Perhaps you could run |
@endrebak : The data is on one chromosome only. So both for ChIP and Input the result is |
I'll try to make single chromosome files myself and see if I can reproduce this. Anyways, could you please run epic If nothing else works, could you possibly send me the files privately? Then it would be simple to debug. Edit: wrote 0.1.15 to begin with. |
Oh, if the files are one chromo only and paired end, you have nothing to gain from using multiple cores, sorry :) |
Come to think of it, the statistics in epic (and SICER) are made for whole genomes. I can fix this, but why do you want to run the software only on a single chromosome? (Would be a lot of work for probably little gain, so I would need many others to request the same thing, for a good reason.) I could not reproduce your bug with one-chromosome files. Would you be able to send me your files with a dropbox-link or something? My e-mail addy is endrebak85 gmail.com |
I ran it on a single chromosome for testing, because I have already tried epic twice to run genome-wide and in both cases I got crashes (and reported them). This time I didn't want to wait several hours until all the huge bam files are processed, especially that I needed to disable parallel processing. There are always some chromosomes with no signal (e.g. chrM) because of some upstream filtering, so this shouldn't make a difference. Here is a result of running single chromosome with version 0.1.15. I am not sure how the data frame should look like, but it looks strange:
|
Now I see why you only want to run it on a single chromosome :) I'll look into this more now. |
Eureka! Thanks for the help. As you can see above, there are multiple entries for the bins in the chip df. These should be summed into one row before merging the chip and input. I'll fix this now. Sorry about this, I had no real paired end data to try my code on. |
Ps. the Btw, if you could tell me what you need |
With If my fix does not work, I know another way to fix the bug, I just wanted to try the simplest one first. |
Now If your problems are not fIxed, please reopen this issue. And perhaps you could try using multiple cores again? It might be that joblib stopped due to the Ps. Ps.ps. epic seems really, really slow on your data. When I run epic it takes at most 5-10 minutes, even with many chip/control files. If your files are not huge, you might want to try unzipping them first, dunno if that matters. If they are huge, you might want to do some QC/preprocessing. Added you to the contributors list for helping me, btw. |
Nope. It didn't help
|
Okay, but I know another fix that is almost sure to fix it. Stay tuned. |
Hi -- In If the backup fix is used you will see the line "Making duplicated bins unique by summing them." Please tell me if you see it :) Btw. The reason the bug happened - in addition to my code - might be that you have mates a very long distance apart. This might be bad since epic uses the midpoint, which could be in an uninteresting region. See this question for my thinking: ChIP-Seq: When reducing paired end reads to one coordinate, should I use the midpoint? |
If the backup is used, you will get an importerror. Will fix this today. |
|
Sb kindly lent me a pair of pe chip seq files so I'll see if it works now. |
I was able to reproduce the error, but epic now works with one core. I'll open a new issue for the multiprocessing bug. See #33 |
(from #30).
The text was updated successfully, but these errors were encountered: