Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Exception: Not valid subsetter: 1" while using epic2-df #32

Closed
wesleylcai opened this issue Sep 6, 2019 · 16 comments
Closed

"Exception: Not valid subsetter: 1" while using epic2-df #32

wesleylcai opened this issue Sep 6, 2019 · 16 comments

Comments

@wesleylcai
Copy link

I'm trying to analyze knockout and wildtype samples (including input for each) using epic2-df. However I get the following error: "Exception: Not valid subsetter: 1"

Here's the full output:
epic2-output.txt

Here are examples (head -n100 of the input files):
TKO: Sample_2D_KDM2A_me3.mqsd.head100.bedpe.txt
CKO: Sample_2D_KDM2A_input.mqsd.head100.bedpe.txt
TWT: Sample_2D_Arab2_me3.mqsd.head100.bedpe.txt
CWT: Sample_2D_Arab2_input.mqsd.head100.bedpe.txt

Here's my command:
epic2-df --treatment-knockout Sample_2D_KDM2A_me3.mqsd.bedpe --control-knockout Sample_2D_KDM2A_input.mqsd.bedpe --treatment-wildtype Sample_2D_Arab2_me3.mqsd.bedpe --control-wildtype Sample_2D_Arab2_input.mqsd.bedpe --genome hg19 --false-discovery-rate-cutoff 0.01 --false-discovery-rate-comparison 0.01 --bin-size 200 --gaps-allowed 3 --fragment-size 200 --chromsizes hg19.chrom.sizes --output-knockout Sample_2D_KDM2A_me3.mqsd --output-wildtype Sample_2D_Arab2_me3.mqsd;

Interesting, some of the commands worked (with another set of bedpe) so it may be incompatibility between some of my bedpe files? Any assistance would be appreciated!

@endrebak
Copy link
Member

endrebak commented Sep 6, 2019

Is this reproducible with just the head? Will look at it on Monday :) Thanks for bothering to report :)

@wesleylcai
Copy link
Author

I tried it with the head and also again with head -n100000

Looks like it works for those files... Hmm so maybe there are some wonky lines in the files? How do you think we can pin-point the problem?

@endrebak
Copy link
Member

endrebak commented Sep 6, 2019

The error seems to be in my pyranges library. The error message says that the chromosome is an int, but it should always be a string. Dunno why it happens, but I am trying to fix it :)

Can you check your version of pyranges with

$ python
import pyranges as pr
pr.__version__

@wesleylcai
Copy link
Author

Ahaaaa. I think I might know why... I used bowtie2 to map my fastq and then converted them to bedpe using bedtools. The scaffold names are "1, 2, 3...X, Y, MT", instead of "chr1, chr2, chr3...chrX, chrY, chrM". Indeed I had to use a custom chrom.sizes file that lists the scaffolds as 1,2,3.

Do you think this could be the cause?

@endrebak
Copy link
Member

endrebak commented Sep 6, 2019

The error is in epic2-df after it has successfully run epic on both KO and WT. So the error happens when it works on the result of those epic2 runs.

@endrebak
Copy link
Member

endrebak commented Sep 6, 2019

Do you think this could be the cause?

No, but I wondered why you used a custom genome sizes file for hg19. When I realized why you did it I added a warning message to epic2 when the chromosome size names and chromosome names in the read file are incompatible.

@wesleylcai wesleylcai reopened this Sep 6, 2019
@endrebak
Copy link
Member

endrebak commented Sep 6, 2019

That is okay, I am hoping the error is due to your pyranges being old :)

@wesleylcai
Copy link
Author

wesleylcai commented Sep 6, 2019

Looks like it's version 0.0.53

$ python Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0] on linux import pyranges as pr pr.__version__ '0.0.53'

The error is in epic2-df after it has successfully run epic on both KO and WT. So the error happens when it works on the result of those epic2 runs.

Indeed, the individual outputs work well and I get two files in the output folder. So I agree with your assessment.

@endrebak
Copy link
Member

endrebak commented Sep 6, 2019 via email

@wesleylcai
Copy link
Author

Yes, I can send you a google drive link. Which email should I use?

@endrebak
Copy link
Member

endrebak commented Sep 6, 2019 via email

@wesleylcai
Copy link
Author

I have sent you an invite via google drive! Thanks for your help.

@endrebak
Copy link
Member

endrebak commented Sep 9, 2019

I have downloaded the files and am running the analysis now. I have some potential fixes that I will attempt tomorrow :)

@endrebak
Copy link
Member

endrebak commented Sep 9, 2019

l was able to reproduce the error. Hooray! Will continue tomorrow. Thanks for sharing a reproducible example :)

@endrebak endrebak closed this as completed Sep 9, 2019
@endrebak endrebak reopened this Sep 9, 2019
@endrebak
Copy link
Member

endrebak commented Sep 9, 2019

(Did not mean to close)

@endrebak
Copy link
Member

(Notes to self)

The error seems to be due to the following:

When pandas reads a table it guesses the types of the columns. For our files it guesses that the chromosome is of type int since it starts with 1, ..., 2, ...., but when it gets to Y and X it changes its mind and thinks the type is object/str.

sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

So you end up with the following different chromosomes:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y']

So initially, it uses an int for lookup.

I have fixed this in epic2 now, I will also need to find a fix that works for PyRanges in general.

Try pip install epic2==0.0.41. The fix will take a few hours to be out on bioconda.

Feel free to reopen if this did not fix it for you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants