Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LTR Harvest "Longest not defined error" #901

Open
joan103777 opened this issue Sep 17, 2018 · 3 comments
Open

LTR Harvest "Longest not defined error" #901

joan103777 opened this issue Sep 17, 2018 · 3 comments
Assignees

Comments

@joan103777
Copy link

Problem description

When I submit an LTR Harvest job it crashes with the error "Longest not defined". This is the error that comes up in the Genome Tools gt script:
^@^@^@^@^@^@longest is already defined as %lu^@^@^@^@^@^@^@program error: not enough space for specpos^@^@^@^@^@program error: too much space for specpos: allocated = %lu != %lu = used^@^@^@^@^@^@^@^@longest is not defined after merging^@^@^@^@gt_sufbwt2fmindex^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@copytheindexfile^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@nextesamergedsufbwttabvalues^@%lu: sum_score=%ld


## Exact command line call triggering the problem
gt ltrharvest

## Example minimal input triggering the problem
gt ltrharvest -index pe.cor.suffix_index.fsa -range 0 0 -seed 30 -xdrop 5 -mat 2 -mis -2 -ins -3 -del -3 -minlenltr 100 -maxlenltr 1000 -mindistltr 1000 -maxdistltr 15000 -similar 90 -out predicted.pe.cor.fsa

## What GenomeTools version are you reporting an issue for (as output by `gt -version`)?
GenomeTools 1.5.9

## Did you compile GenomeTools from source? If so, please state the `make` parameters used.
make with no parameters

## What operating system (e.g. Ubuntu, Mac OS X), OS version (e.g. 15.10, 10.11) and platform (e.g. x86_64) are you using?
Mac OS 8 GB but the program is being run in a server with more memory.
@satta
Copy link
Member

satta commented Sep 21, 2018

Thanks for reporting this! Unfortunately I was unable to reproduce your issue with a test sequence of my choice. To allow for further debugging, could you perhaps

  • provide the input sequence you used or, if confidential, a substring of it that triggers the problem,
  • provide the full set of commands you used to prepare the input, including the gt suffixerator step?
  • try to recreate all indices by deleting all files starting with your sequence file name (pe.cor.suffix_index.fsa) and redo all steps in a clean index directory? What kind of storage are you creating and keeping your indices on (e.g. local temp directory or network mount)? How much free space is available there?
  • post the .prj file of your created index, e.g. pe.cor.suffix_index.fsa.prj?

@satta satta self-assigned this Sep 23, 2018
@joan103777
Copy link
Author

Hi I'm thinking this could be a memory error. The file size is 145,382,055,251.
For the Suffixerator step I'm running the script:
gt suffixerator -db pe.cor.gz -indexname pe.cor.gz -tis -suf -lcp -des -ssp -sds -dna
When I run this the suffix table is created however the .prj file isn't able to finish installing due to the server time limit (~48 hours). I tried to split the job by rerunning the script with only the -des -ssp -sds -dna commands and this created a .prj file.
I then tried to run LTRHarvest and came across the "longest not defined" error. This error crashes the program before it can start. I also did all these steps in a clean directory and that hasn't seemed to work.
In regards to where the files are being stored, they're currently in a local temp directory. I'm not sure how much free space is available there but I think it's enough as I haven't had a memory related issue. Attached is the .prj file
pe.cor.fsa.prj.zip

@satta
Copy link
Member

satta commented Jan 8, 2019

Indeed this .prj file is missing the longest keyword, leading to the error.
Could it be that you might still have remains from your first run in the directory but a .prjfile from your second run, without generating the enhanced suffix array (built via -suf -lcp, which you will need to run LTRharvest anyway)? This could allow you to run LTRharvest so it won't complain anout the missing suffix array and LCP table, but then complain about the mismatching project file missing required entries.

Could you retry the indexing step with all options in a clean directory, on a machine that has enough power and/or memory to finish the indexing step? You can split the input FASTA file into several parts sequences and process them individually. If you don't have enough memory but can somehow solve the time limit problem, then you can use the -parts option to gt suffixerator to process parts of the input sequence sequentially. For example -parts 10 would divide the input sequence into 10 parts and process these one at a time. See gt suffixerator -helpdev for more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants