Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: assert len(merged_df) == chip_df_nb_bins #31

Closed
endrebak opened this issue Aug 9, 2016 · 21 comments
Closed

AssertionError: assert len(merged_df) == chip_df_nb_bins #31

endrebak opened this issue Aug 9, 2016 · 21 comments

Comments

@endrebak
Copy link
Member

endrebak commented Aug 9, 2016

Traceback (most recent call last):
  File "/usr/local/bin/epic", line 165, in <module>
    run_epic(args)
  File "/usr/local/lib/python2.7/dist-packages/epic/run/run_epic.py", line 42, in run_epic
    args.number_cores)
  File "/usr/local/lib/python2.7/dist-packages/epic/utils/helper_functions.py", line 37, in merge_chip_and_input
    for chip_df, input_df in zip(chip_dfs, input_dfs))
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 754, in __call__
    while self.dispatch_one_batch(iterator):
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 604, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 567, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/usr/local/lib/python2.7/dist-packages/joblib/_parallel_backends.py", line 109, in apply_async
    result = ImmediateResult(func)
  File "/usr/local/lib/python2.7/dist-packages/joblib/_parallel_backends.py", line 322, in __init__
    self.results = batch()
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 127, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/lib/python2.7/dist-packages/epic/utils/helper_functions.py", line 24, in _merge_chip_and_input
    assert len(merged_df) == chip_df_nb_bins
AssertionError

(from #30).

@endrebak
Copy link
Member Author

endrebak commented Aug 9, 2016

(Just thinking out loud below):

This is the code:

def _merge_chip_and_input(chip_df, input_df):

    chip_df_nb_bins = len(chip_df)
    merged_df = chip_df.merge(input_df,
                              how="left",
                              on=["Chromosome", "Bin"],
                              suffixes=[" ChIP", " Input"])
    merged_df = merged_df[["Chromosome", "Bin", "Count ChIP", "Count Input"]]
    merged_df.columns = ["Chromosome", "Bin", "ChIP", "Input"]

    merged_df = merged_df.fillna(0)

    assert len(merged_df) == chip_df_nb_bins

    return merged_df

It just left merges (left outer join in SQL-speak) the chip and input dataframes. This means that it should keep all the bins (e.g. chr1 200, chrX 496000) in the chip dataframe and attach any input data in the same bins, discarding the rest of the input. According to this description the assertion should always hold, since the chip bins are always kept.

@endrebak
Copy link
Member Author

endrebak commented Aug 9, 2016

I suspect this is a pathological edge-case, just cannot understand which. I will add a debug mode with much more info printed to the screen so we can track down the error.

Could you please add the command line args you used?

@endrebak
Copy link
Member Author

endrebak commented Aug 9, 2016

@balwierz The new function has an assertion message that indicates the input to the function and the resulting output. I hope it will be enough to debug the error.

I have no guess about what the reason for the error might be, but I hope it it is not that the files represent different chromosomes.

@endrebak
Copy link
Member Author

endrebak commented Aug 9, 2016

Perhaps you could run bzcat <yourfile> | cut -f 1 | sort -u on your files and report the input back to me? (gzcat might also work.) Thanks for helping me debug this error.

@balwierz
Copy link

balwierz commented Aug 9, 2016

@endrebak : The data is on one chromosome only. So both for ChIP and Input the result is chr3.

@endrebak
Copy link
Member Author

endrebak commented Aug 9, 2016

I'll try to make single chromosome files myself and see if I can reproduce this.

Anyways, could you please run epic 0.1.5 on your data and send me the output of the new error? Thanks.

If nothing else works, could you possibly send me the files privately? Then it would be simple to debug.

Edit: wrote 0.1.15 to begin with.

@endrebak
Copy link
Member Author

endrebak commented Aug 9, 2016

Oh, if the files are one chromo only and paired end, you have nothing to gain from using multiple cores, sorry :)

@endrebak
Copy link
Member Author

endrebak commented Aug 9, 2016

Come to think of it, the statistics in epic (and SICER) are made for whole genomes.

I can fix this, but why do you want to run the software only on a single chromosome? (Would be a lot of work for probably little gain, so I would need many others to request the same thing, for a good reason.)

I could not reproduce your bug with one-chromosome files. Would you be able to send me your files with a dropbox-link or something? My e-mail addy is endrebak85 gmail.com

@balwierz
Copy link

balwierz commented Aug 9, 2016

I ran it on a single chromosome for testing, because I have already tried epic twice to run genome-wide and in both cases I got crashes (and reported them). This time I didn't want to wait several hours until all the huge bam files are processed, especially that I needed to disable parallel processing. There are always some chromosomes with no signal (e.g. chrM) because of some upstream filtering, so this shouldn't make a difference.

Here is a result of running single chromosome with version 0.1.15. I am not sure how the data frame should look like, but it looks strange:

 # epic --treatment H3K27me3.chr3.bedpe.bz2 --control input.chr3.bedpe.bz2 --genome danRer7 --paired-end --store-matrix H3K27me3.matrix (File: epic, Log level: INFO, Time: Tue, 09 Aug 2016 13:52:17 )
Using paired end so setting readlength to 100. (File: epic, Log level: INFO, Time: Tue, 09 Aug 2016 13:52:18 )
Using an effective genome fraction of 0.9144196231594702. (File: genomes, Log level: INFO, Time: Tue, 09 Aug 2016 13:52:18 )
Binning H3K27me3.chr3.bedpe.bz2 (File: run_epic, Log level: INFO, Time: Tue, 09 Aug 2016 13:52:18 )
Binning chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, M (File: count_reads_in_windows, Log level: INFO, Time: Tue, 09 Aug 2016 13:52:18 )
Binning input.chr3.bedpe.bz2 (File: run_epic, Log level: INFO, Time: Tue, 09 Aug 2016 14:06:33 )
Binning chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, M (File: count_reads_in_windows, Log level: INFO, Time: Tue, 09 Aug 2016 14:06:33 )
Merging ChIP and Input data. (File: helper_functions, Log level: INFO, Time: Tue, 09 Aug 2016 14:21:53 )
# epic --treatment H3K27me3.chr3.bedpe.bz2 --control input.chr3.bedpe.bz2 --genome danRer7 --paired-end --store-matrix H3K27me3..matrix
Traceback (most recent call last):
  File "/usr/local/bin/epic", line 165, in <module>
    run_epic(args)
  File "/usr/local/lib/python2.7/dist-packages/epic/run/run_epic.py", line 42, in run_epic
    args.number_cores)
  File "/usr/local/lib/python2.7/dist-packages/epic/utils/helper_functions.py", line 51, in merge_chip_and_input
    for chip_df, input_df in zip(chip_dfs, input_dfs))
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 754, in __call__
    while self.dispatch_one_batch(iterator):
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 604, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 567, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/usr/local/lib/python2.7/dist-packages/joblib/_parallel_backends.py", line 109, in apply_async
    result = ImmediateResult(func)
  File "/usr/local/lib/python2.7/dist-packages/joblib/_parallel_backends.py", line 322, in __init__
    self.results = batch()
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 127, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/lib/python2.7/dist-packages/epic/utils/helper_functions.py", line 38, in _merge_chip_and_input
    assert len(merged_df) == chip_df_nb_bins, assertion_message
AssertionError: Wrong number of rows after merging ChIP/Input.
ChIP bins: 719137
Input bins: 1122376
Head of ChIP df: 
 Chromosome Bin Count
0 chr3 400 1
1 chr3 800 3
2 chr3 600 1
3 chr3 800 5
4 chr3 1000 7

Head of Input df: 
 Chromosome Bin Count
0 chr3 0 4
1 chr3 600 4
2 chr3 800 10
3 chr3 1000 1
4 chr3 800 3

Tail of ChIP df: 
 Chromosome Bin Count
719132 chr3 63268800 4
719133 chr3 63268600 1
719134 chr3 63268800 39
719135 chr3 63268600 1
719136 chr3 63268800 60

Tail of Input df: 
 Chromosome Bin Count
1122371 chr3 63268800 22
1122372 chr3 63268600 2
1122373 chr3 63268800 2
1122374 chr3 63268600 1
1122375 chr3 63268800 24

Number of bins in merged df: 
3386805
Head of merged df: 
 Chromosome Bin Count
0 chr3 0 4
1 chr3 600 4
2 chr3 800 10
3 chr3 1000 1
4 chr3 800 3

Tail of merged df: 
 Chromosome Bin Count
1122371 chr3 63268800 22
1122372 chr3 63268600 2
1122373 chr3 63268800 2
1122374 chr3 63268600 1
1122375 chr3 63268800 24

@endrebak
Copy link
Member Author

Now I see why you only want to run it on a single chromosome :)

I'll look into this more now.

@endrebak
Copy link
Member Author

endrebak commented Aug 10, 2016

Eureka! Thanks for the help.

As you can see above, there are multiple entries for the bins in the chip df. These should be summed into one row before merging the chip and input. I'll fix this now.

Sorry about this, I had no real paired end data to try my code on.

@endrebak
Copy link
Member Author

endrebak commented Aug 10, 2016

Ps. the --store-matrix does not work for pe data yet. I'll try to fix that afterwards.

Btw, if you could tell me what you need --store-matrix for, that would be helpful to me. I use it to run linear models on the chip regions I find (with great success) :)

@endrebak
Copy link
Member Author

endrebak commented Aug 10, 2016

With 0.1.6 I have tried to fix this with the minimal possible surgery. It seems that I had made an off by one error when using unix sort (it is 1-indexed, while python uses 0-indexing).

If my fix does not work, I know another way to fix the bug, I just wanted to try the simplest one first.

@endrebak
Copy link
Member Author

endrebak commented Aug 10, 2016

Now 0.1.6 is out on PyPI, where I have fixed the paired-end problems, but also allowed --store-matrix for paired-end files.

If your problems are not fIxed, please reopen this issue. And perhaps you could try using multiple cores again? It might be that joblib stopped due to the AssertionError, but did not propagate that error message properly.

Ps. --store-matrix stores the data as a gzipped file, since writing that data to disk often takes a lot longer than the algorithm itself.

Ps.ps. epic seems really, really slow on your data. When I run epic it takes at most 5-10 minutes, even with many chip/control files. If your files are not huge, you might want to try unzipping them first, dunno if that matters. If they are huge, you might want to do some QC/preprocessing.

Added you to the contributors list for helping me, btw.

@balwierz
Copy link

Nope. It didn't help

Traceback (most recent call last):
  File "/usr/local/bin/epic", line 165, in <module>
    run_epic(args)
  File "/usr/local/lib/python2.7/dist-packages/epic/run/run_epic.py", line 42, in run_epic
    args.number_cores)
  File "/usr/local/lib/python2.7/dist-packages/epic/utils/helper_functions.py", line 51, in merge_chip_and_input
    for chip_df, input_df in zip(chip_dfs, input_dfs))
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 754, in __call__
    while self.dispatch_one_batch(iterator):
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 604, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 567, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/usr/local/lib/python2.7/dist-packages/joblib/_parallel_backends.py", line 109, in apply_async
    result = ImmediateResult(func)
  File "/usr/local/lib/python2.7/dist-packages/joblib/_parallel_backends.py", line 322, in __init__
    self.results = batch()
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 127, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/lib/python2.7/dist-packages/epic/utils/helper_functions.py", line 38, in _merge_chip_and_input
    assert len(merged_df) == chip_df_nb_bins, assertion_message
AssertionError: Wrong number of rows after merging ChIP/Input.
ChIP bins: 714215
Input bins: 1121027
Head of ChIP df: 
 Chromosome Bin Count
0 chr3 400 1
1 chr3 800 3
2 chr3 600 1
3 chr3 800 5
4 chr3 1000 7

Head of Input df: 
 Chromosome Bin Count
0 chr3 0 4
1 chr3 600 4
2 chr3 800 10
3 chr3 1000 1
4 chr3 800 3

Tail of ChIP df: 
 Chromosome Bin Count
714210 chr3 63268800 23
714211 chr3 63268600 1
714212 chr3 63268800 27
714213 chr3 63268600 2
714214 chr3 63268800 98

Tail of Input df: 
 Chromosome Bin Count
1121022 chr3 63268800 20
1121023 chr3 63268600 1
1121024 chr3 63268800 4
1121025 chr3 63268600 3
1121026 chr3 63268800 26

Number of bins in merged df: 
3355009
Head of merged df: 
 Chromosome Bin ChIP Input
0 chr3 400 1 0.0
1 chr3 800 3 10.0
2 chr3 800 3 3.0
3 chr3 600 1 4.0
4 chr3 800 5 10.0

Tail of merged df: 
 Chromosome Bin ChIP Input
3355004 chr3 63268800 98 24.0
3355005 chr3 63268800 98 16.0
3355006 chr3 63268800 98 20.0
3355007 chr3 63268800 98 4.0
3355008 chr3 63268800 98 26.0

@endrebak
Copy link
Member Author

Okay, but I know another fix that is almost sure to fix it. Stay tuned.

@endrebak endrebak reopened this Aug 11, 2016
@endrebak
Copy link
Member Author

endrebak commented Aug 11, 2016

Hi --

In 0.1.7 I have implemented one new fix and another backup fix in case the first did not work.

If the backup fix is used you will see the line "Making duplicated bins unique by summing them." Please tell me if you see it :)

Btw. The reason the bug happened - in addition to my code - might be that you have mates a very long distance apart. This might be bad since epic uses the midpoint, which could be in an uninteresting region. See this question for my thinking: ChIP-Seq: When reducing paired end reads to one coordinate, should I use the midpoint?

@endrebak
Copy link
Member Author

If the backup is used, you will get an importerror. Will fix this today.

@endrebak
Copy link
Member Author

endrebak commented Aug 11, 2016

0.1.8 is out and I am rather confident it will work. Looking forward to hearing how it goes. If it does not work, I would appreciate it if you could share your chr3 files with me :)

@endrebak
Copy link
Member Author

Sb kindly lent me a pair of pe chip seq files so I'll see if it works now.

@endrebak
Copy link
Member Author

endrebak commented Aug 11, 2016

I was able to reproduce the error, but epic now works with one core. I'll open a new issue for the multiprocessing bug. See #33

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants