Not sure how I should deal with the forecast results and the results of the existing database #20

ld9866 · 2021-11-23T01:46:28Z

Hello! Thank you very much for developing such good software. Recently, I have a few unsure questions that I would like to ask you.
We assembled a genome and we made predictions through RepeatModeler software. We ran the RepeatMasker software using the predicted set of repeat sequences (***-family. fa) and found that the repeat sequence of the genome was as high as 38.75 %. This may be right. However, we also used the repetitive sequences of existing species based on dfam and repbase databases, and we found that the repetitive sequences reached 41.31%. I also tried to merge the prediction results with the results of the database and found that the repetitive sequence was as high as 44.71%. This result was much higher than our expectations. This may be wrong.
I am not sure which result we should use for subsequent genome annotation analysis, I would like to ask you.
Looking forward to hearing from you!

jebrosen · 2021-11-30T20:20:15Z

Hi,

That is a good question. But, this does not necessarily sound like a problem. RepeatModeler processes samples of a particular genome assembly, so it may miss some specific TEs due to "bad luck" or due to limitations in the sequencing or assembly process that make them difficult to recognize. Dfam and RepBase include TE families from ancestral species, which are known from prior research but are too fragmented or mutated to meet RepeatModeler's thresholds. Due to the different limitations of the two approaches, it can be more informative to combine the newly discovered elements into one library as you did.

Does this seem likely to explain the differences in your results?

ld9866 · 2021-12-01T00:32:41Z

Hello!
Very happy to receive your reply!
You said that we used the combined collection as the input file. We tried it, and the result was very confusing to me.
We use the combined (cat ) result "the prediction result of RepeatModeler, dfam and repbase" and the result is 44.71% as l said before.
If we first use the prediction result of RepeatModeler to shield the repetitive sequence., and then the shielded sequence again run the set of dfam and repbase the result is 43.64%. We first use the set of dfam and repbase and then run the the prediction result of RepeatModeler, the result is 43.91%.
I would like to ask which method should we use?
The combine is a very confusing place for us. What causes such a deviation?
Looking forward to hearing from you!

jebrosen · 2021-12-01T18:13:14Z

The combine is a very confusing place for us. What causes such a deviation?

Running RepeatMasker twice with two different libraries will likely produce different results from running it once with a combined library. For example, the two repeat libraries could include similar but not identical families or fragments of families. In this situation, the first RepeatMasker run might mask most of an element. Then, the second RepeatMasker run might not recognize the leftover part because it is too short. So, each RepeatMasker run can affect the other run depending on the order. If RepeatMasker starts with a combined library instead, it can more effectively discover the elements from both libraries at once.

I would like to ask which method should we use?

The most appropriate method will depend on your goal and how well the RepeatModeler libraries came out for your species. For example, one method might mask more sequence, while another could produce a cleaner annotation with fewer fragments or more well-known names.

I hope this explanation helps you to decide what method to use!

ld9866 · 2021-12-02T03:32:42Z

Thank you very much for your valuable advice and patient help.
Your help is really of great significance to our work.
We decided to "combine" it with the method you said.
Maybe this is a more appropriate method.
Sincerely yours

jebrosen · 2021-12-03T18:22:40Z

Glad to hear! It seems like this question has been answered, but please re-open this or a new issue if you have more.

jebrosen added question This issue is a support question upstream Feature or bug is in an upstream package such as RepeatMasker or another included program labels Dec 2, 2021

jebrosen closed this as completed Dec 3, 2021

jebrosen mentioned this issue Dec 15, 2021

Making a custom database (Dfam mammals + species-specific from RepeatModeler) in newer versions of RepeatMasker. Dfam-consortium/RepeatMasker#143

Closed

This was referenced May 24, 2022

> The combine is a very confusing place for us. What causes such a deviation? #24

Closed

How different methods on RepeatMasker running can lead to different results meiyang12/Genome-annotation-pipeline#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not sure how I should deal with the forecast results and the results of the existing database #20

Not sure how I should deal with the forecast results and the results of the existing database #20

ld9866 commented Nov 23, 2021

jebrosen commented Nov 30, 2021

ld9866 commented Dec 1, 2021

jebrosen commented Dec 1, 2021

ld9866 commented Dec 2, 2021

jebrosen commented Dec 3, 2021

Not sure how I should deal with the forecast results and the results of the existing database #20

Not sure how I should deal with the forecast results and the results of the existing database #20

Comments

ld9866 commented Nov 23, 2021

jebrosen commented Nov 30, 2021

ld9866 commented Dec 1, 2021

jebrosen commented Dec 1, 2021

ld9866 commented Dec 2, 2021

jebrosen commented Dec 3, 2021