Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault loading input SNP file with unexpected format #6

Open
yancylo opened this issue Feb 23, 2022 · 11 comments
Open

Segfault loading input SNP file with unexpected format #6

yancylo opened this issue Feb 23, 2022 · 11 comments

Comments

@yancylo
Copy link

yancylo commented Feb 23, 2022

I am trying to run RELI on a list of 3,635 genome-wide significant variants which are aggregated into 78 LD blocks, but after RELI loaded in the snp table, it failed when loading the LD table. The error I got is "Segmentation fault":

|---------------------------------------------------------------|
|                                                               |
|       Regulatory Element Locus Intersection (RELI) Analysis   |
|                       Current version: 0.90                   |
|                                                               |
|---------------------------------------------------------------|
Start Regulatory Element Locus Intersection (RELI) analysis.
Running arguements:
1) phenotype snp file: multiple_sclerosis.snp
2) phenotype LD structure file: multiple_sclerosis.ld
3) SNP matching mode: 0
4) null model file: ../data/Null/CommonSNP_MAFmatch
5) dbSNP table file: ../data/SNPtable/SNPtable
6) target chip-seq label: hg19_0302
7) chip-seq index file: ../data/ChIPseq.index
8) chip-seq data dir: ../data/ChIP-seq/
9) output dir name: Output/
10) genome build file:
11) statistics output file name: Output//hg19_0302.RELI.stats
12) overlapped locus numbers output file name: Output//hg19_0302.RELI.overlaps
13) overlapped snps output file name: Output//hg19_0302.RELI.rsids
14) provided phenotype name: MS
15) provided ancestry name: .
using default hg19 genome build
genome structure loaded.
chip-seq index file loaded.
target ChIP-seq file set.
target ChIP-seq file loaded, sorted, and width calculated.
null model loaded.
reading snp file completed.
snp table loaded.
snp MAF information queried.
phenotype LD file loaded.
Segmentation fault

I followed the format of SLE_EU.ld in the example/ folder but I am not sure if I am missing something. My .ld file contains 78 rows, each row starts with the top hit variant of the LD block, and is followed by ":" and a list of rs IDs in the same LD block with genome-wide significant pvalue. All rsIDs in the .ld file are listed in the .snp file (BED4 format).

I tried to run RELI without the .ld file (only run with .snp). It finished without any errors, but of course the results were not legitimate because LD was ignored - I got a Ratio of 1 (100% intersect) and p-value of 0.

Please advise on how to properly account for LD in the analysis. Thank you very much!

@ernstki
Copy link
Contributor

ernstki commented Feb 24, 2022

Hi, @yancylo. If it's possible to share it, please email your .ld file to tftoolsadmin -at- cchmc.org so I can have a look at it.

If you can't share it (I am not a scientist, but I assume there might be reasons you can't), please run file your_ld_file.ld and report back if it says something like

$ file your_ld_file.ld
your_ld_file.ld: ASCII text, with CRLF line terminators

What I'm looking for specifically is DOS/Windows line endings. It's possible that RELI needs to be updated to be more robust when reading files in this format. The workaround is to simply convert to Unix (linefeed-terminated) format with dos2unix and try again.

Other ways you can troubleshoot:

# macOS
cat -vte your_ld_file.ld | head

# Linux
cat -A your_ld_file.ld | head

If any ^M (ASCII carriage return) characters appear in that output, then run dos2unix your_ld_file.ld and try again.

macOS doesn't have a dos2unix command built-in, but you can use this as an alternative:

tr -d \\r < your_ld_file.ld > new_ld_file.ld

Please report back if this resolves / doesn't resolve your segfault error.

@yancylo
Copy link
Author

yancylo commented Feb 24, 2022

Hi @ernstki thank you for your help. I tried dos2unix but it did not resolve the segfault. There are no DOS/Windows line endings.

file meta_analysis_r1_multiple_sclerosis_nohla.ld
meta_analysis_r1_multiple_sclerosis_nohla.ld: ASCII text, with very long lines

Are there any other specifications of the LD input file that I am overlooking? e.g. Do the lists of rsIDs need to be sorted/ organized in specific ways?

Thanks

@ernstki
Copy link
Contributor

ernstki commented Feb 24, 2022

@yancylo Can you please update the issue description with the exact command line you used (sanitizing any filenames if you need to)?

The message phenotype LD file loaded. appears after the LD file is already loaded, so I may have led you on a wild goose chase here. Sorry 'bout that.

My advice about using file * in the directory with all your input files still stands, though.

@yancylo
Copy link
Author

yancylo commented Feb 25, 2022

Hi @ernstki thanks for checking, here is the command I used

../RELI \
-snp multiple_sclerosis.snp \
-ld multiple_sclerosis.ld \
-index ../data/ChIPseq.index \
-data ../data/ChIP-seq/ \
-target hg19_0302 \
-null ../data/Null/CommonSNP_MAFmatch \
-dbsnp ../data/SNPtable/SNPtable \
-phenotype MS_meta \
-out Output/

@ernstki
Copy link
Contributor

ernstki commented Feb 25, 2022

@yancylo Can you confirm with file or cat -A / cat -vte as described above that the .snp file is also free from DOS/Windows line endings?

If you are in a completely Unix/macOS environment, and not analyzing files from collaborators, or files that made a pass through Microsoft Excel, there's usually little need. But it's always a first troubleshooting step when I find myself in situations like this.

It would also help if you could cat -A multiple_sclerosis.ld | head or cat -vte multiple_sclerosis.ld | head and confirm that the delimiter is indeed the tab character, and that no lines in the LD input file end with a tab character. In the output of cat -A / cat -vte (on macOS), tab characters show up as ^I (circumflex, capital eye). A Unix newline (ASCII linefeed) will show up as $.

@ernstki
Copy link
Contributor

ernstki commented Feb 25, 2022

Emailing or sharing the input files—if you are able—via Dropbox / Google Drive / OneDrive / whatever-you-use with tftoolsadmin -at- cchmc.org (that's me) would still be a tremendous help here. It completely takes out the guesswork.

@yancylo
Copy link
Author

yancylo commented Feb 25, 2022

@ernstki Sorry I will need to get my manager's approval before sharing the data files with you, I will get back to you on that.

Re: I did cat -A on both .snp and .ld files, here are the top lines:

$ cat -A multiple_sclerosis.ld | head -1
rs10797431:^Irs3762438^Irs4648482^Irs144142039^Irs4648663^Irs11810740^Irs12731995^Irs6687680^Irs2377041^Irs12044475^Irs2296443^Irs12042279^Irs3748825^Irs2182176^Irs4449972^Irs4648647^Irs12756665^Irs10797432^Irs2227312^Irs7537581^Irs12725202^Irs72644701^Irs4648450^Irs72644684^Irs72646003^Irs56305550^Irs12138909^Irs72646016^Irs4648662^Irs6667605^Irs10797433^Irs10797434^Irs60389769^Irs6657596^Irs10797431^Irs11577783^Irs138680782^Irs10910094^Irs6672381^Irs2495366^Irs146090614^Irs10910090^Irs7538063^Irs2227313^Irs2281852^Irs7550231^Irs61054170^Irs881640^Irs10910092^Irs10752745^Irs4870^Irs1974044$

$ file multiple_sclerosis.ld
multiple_sclerosis.ld: ASCII text, with very long lines

$file multiple_sclerosis.snp
multiple_sclerosis.snp: ASCII text

$ cat -A multiple_sclerosis.snp | head -5
chr1^I2486868^IC^IT^Irs3762438$
chr1^I2749921^IT^IC^Irs4648482$
chr1^I2696582^IC^IT^Irs144142039$
chr1^I2573727^IA^IC^Irs4648663$
chr1^I2711631^IG^IA^Irs11810740$

Do the .snp and .ld files need to be sorted in a specific order for RELI to work?

@ernstki
Copy link
Contributor

ernstki commented Feb 26, 2022

That's understandable. What you provided should be enough, though.

I'm not aware of any requirement that the input files be sorted, but I believe the .snp file needs to be a 4-column BED format coordinates file. I'm not sure why we chose .snp as the extension for that instead of (the more obvious extension) .bed.

This is just solely based on an input file that I know works with RELI, which is example/SLE_EU.snp.

Either way, you have found a bug in RELI (thanks for bringing it to our attention!), and it will still need to be updated to be more robust against malformed input.

@ernstki
Copy link
Contributor

ernstki commented Feb 26, 2022

Here is how I converted your SNPs to the expected format:

awk -v OFS='\t' '{print $1,$2-1,$2,$5}' multiple_sclerosis.snp > multiple_sclerosis.bed

Then give multiple_sclerosis.bed to the -snp option of RELI.

The last SNP (rs118107401) appears to have incorrect coordinates to me, but I'm not a bioinformatician, so you may know the reason for that.

See also this script for fetching coordinates in the appropriate (BED4) format based on rs ID, if that's what you're starting from. (Note that it defaults to hg19 and snp142, which you can easily change.)

Using your multiple_sclerosis.snp as an example:

cut -f5 multiple_sclerosis.snp | getsnpcoords > multiple_sclerosis.bed

@yancylo
Copy link
Author

yancylo commented Feb 28, 2022

Thank you very much for your help! I converted my input list of SNPs to the standard .bed format and it worked.
The last SNP is rs11810740 (not 401) which is chr1:2711631:G:A in GRCh37 or hg19.
Anyway, thank you very much for helping me identify the issue!

@yancylo yancylo closed this as completed Feb 28, 2022
@ernstki
Copy link
Contributor

ernstki commented Feb 28, 2022

Oh! Poor attention to detail on my part. Who knows where I came up with that extra "1"; thanks for sorting me out on that.

Also, I'm glad we were able to work through it and get you rolling.

Let's leave this issue open until I can get the code updated to not crash on SNP input like yours. GitHub will generate a few more automatic emails as I'm referencing it from commits and PRs, but you can safely ignore those.

I don't have expansive knowledge of all the bioinformatics formats out there, so if you don't mind sharing, what analysis tool, web site, or database is generating the SNP list in that format? If it's a very common tool/database, we might want to look at just accepting SNP lists in that format directly.

@ernstki ernstki reopened this Feb 28, 2022
@ernstki ernstki changed the title Segmentation fault when loading LD file Segfault loading input SNP file with unexpected format Mar 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants