Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in code #3

Open
jzohren opened this issue Aug 10, 2023 · 0 comments
Open

bug in code #3

jzohren opened this issue Aug 10, 2023 · 0 comments

Comments

@jzohren
Copy link

jzohren commented Aug 10, 2023

Hi Christopher,

Thanks for sharing your script to parse RM output. I just used it as part of the FastTE pipeline - very useful!

I went through the R script more or less line by line today as I wanted to change the output format slightly and then noticed a bug. I traced this back to the way you are reading in the RM output in line 26. Basically, if the first column (sw_score) does not start with a white space, all of the columns of that line are shifted to the left and the resulting data frame is incorrect. I found a fix to get around this issue, see below.

Lines in raw RM output:

27245    0.0  0.0  1.7  Chr1      20619926 20623101  (9804570) + TE_00000209     __ClassII_DNA_CACTA_nMITE        2683   5805  (1708)  7595
  625    2.4  0.0  2.4  Chr1      20623765 20623848  (9803823) + TE_00000209     __ClassII_DNA_CACTA_nMITE        5806   5887  (1626)  7596
 1409    0.6  0.0  0.0  Chr1      20624203 20624367  (9803304) + TE_00000209     __ClassII_DNA_CACTA_nMITE        5878   6042  (1471)  7597

Original representation in the R data frame (first row should start with 27245, but all columns are shifted to the left):

sw_score	perc_div	perc_del	perc_insert	qry_id	qry_start	qry_end	qry_left	matching_repeat	repeat_id	matching_class	no_bp_in_complement	in_repeat_start	in_repeat_end
0	0	1.7	Chr1	20619926	20623101	-9804570	+	TE_00000209	__ClassII_DNA_CACTA_nMITE	2683	5805	-1708	7595
625	2.4	0	2.4	Chr1	20623765	20623848	-9803823	+	TE_00000209	__ClassII_DNA_CACTA_nMITE	5806	5887	-1626
1409	0.6	0	0	Chr1	20624203	20624367	-9803304	+	TE_00000209	__ClassII_DNA_CACTA_nMITE	5878	6042	-1471

R data frame after fixing:

sw_score	perc_div	perc_del	perc_insert	qry_id	qry_start	qry_end	qry_left	matching_repeat	repeat_id	matching_class	no_bp_in_complement	in_repeat_start	in_repeat_end
27245	0	0	1.7	Chr1	20619926	20623101	-9804570	+	TE_00000209	__ClassII_DNA_CACTA_nMITE	2683	5805	-1708
625	2.4	0	2.4	Chr1	20623765	20623848	-9803823	+	TE_00000209	__ClassII_DNA_CACTA_nMITE	5806	5887	-1626
1409	0.6	0	0	Chr1	20624203	20624367	-9803304	+	TE_00000209	__ClassII_DNA_CACTA_nMITE	5878	6042	-1471

This is how I fixed it (replace line 26 with these two lines of code):

str.res <- unlist(stringr::str_trim(x))
str.res <- unlist(stringr::str_split(str.res, "\\s+"))

Hope this is useful to others as well.

Best wishes,
Jasmin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant