-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option -r 1 in cdhit-est does not work #13
Comments
Hi @SamaZYX , I haven't had the need to cluster unstranded data. Actually, it looks like I over-rode the CellaRepertorium/R/cdhit-methods.R Line 77 in 27c07ab
Do you have a reproducible example? As a next step, I would try calling cdhit directly, via CellaRepertorium/R/cdhit-methods.R Line 89 in 27c07ab
|
Hi @amcdavid , My case is not related to VH nor Single Cell experiments. I was just trying to find if someone that had already made an implementation of CD-HIT in R for general use. Unfortunately, I am no expert on R yet, and by looking at your scripts I just found out that you can add C code in R with Rcpp. I tested your suggestion of running library("CellaRepertorium")
library("Biostrings")
library("dplyr")
cdhitestC2 <- function(opts, name, showProgress) {
.Call('_CellaRepertorium_cdhitestC', PACKAGE = 'CellaRepertorium', opts, name, showProgress)
}
cdhit2 = function(seqs, identity = NULL, kmerSize = NULL, min_length = 6, s = 1, G = 1,
only_index = FALSE, showProgress = interactive(), ...) {
if(any(width(seqs) < min_length)) stop("Some sequences shorter than `min_length`;
remove these or decrease min_length")
name = 'CD-Hit'
uopts = list(...)
options = list()
options$i <- tempfile()
options$s = s
writeXStringSet(seqs, options$i)
on.exit(unlink(options$i))
type = switch(
class(seqs),
AAStringSet = 'cdhitC',
DNAStringSet = 'cdhitestC',
stop('seqs must be either AAStringSet or DNAStringSet')
)
if(type == 'cdhitestC'){ #DNA
kmerSize = case_when(identity < .8 ~ 4, identity < .85 ~ 5,
identity < .88 ~ 6, identity < .9 ~ 7,
identity < .95 ~ 9, identity < 1 ~ 10, TRUE ~ 11)
options = c(options, list(ap = 1, r = 1))
} else{
kmerSize = 5
}
options$n = kmerSize
options$G = G
options$c = identity
options$l = min_length - 1
options = c(uopts, options)
options = options[!duplicated(names(options))]
options = lapply(options, as.character)
if(type == 'cdhitC'){
seq_cluster_index = CellaRepertorium::cdhitC(options, name, showProgress) + 1
} else{
seq_cluster_index = cdhitestC2(options, name, showProgress) + 1
}
if(only_index) return(seq_cluster_index)
tibble::tibble(query_name = names(seqs), seq = as.character(seqs),
cluster_idx = seq_cluster_index) %>%
dplyr::group_by(cluster_idx) %>%
dplyr::mutate(n_cluster = dplyr::n()) %>% ungroup()
}
tbl = cdhit2(nnseq, identity = 1, only_index = FALSE, g=1, l=10, r=1) I tested the I can see that I know that running cdhitest in both strands is not related to your line of work, however, it would be very nice to be able to use cdhit in your package in other lines of work where clustering of unstranded data is needed. Thanks for your reply. |
I am trying to us the cdhit command to cluster nucleotide sequences in both strands but it does not seem to work.
The option "-r 1" (default parameter) tells cdhit to search in both strands but no matter what value I use, it is not reading it.
Could this be related to the missing part in cdhit-est.cpp from the original code?
The text was updated successfully, but these errors were encountered: