Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option -r 1 in cdhit-est does not work #13

Open
SamaZYX opened this issue Oct 5, 2022 · 2 comments
Open

Option -r 1 in cdhit-est does not work #13

SamaZYX opened this issue Oct 5, 2022 · 2 comments

Comments

@SamaZYX
Copy link

SamaZYX commented Oct 5, 2022

I am trying to us the cdhit command to cluster nucleotide sequences in both strands but it does not seem to work.

The option "-r 1" (default parameter) tells cdhit to search in both strands but no matter what value I use, it is not reading it.

Could this be related to the missing part in cdhit-est.cpp from the original code?

if ( options.option_r ) {
	Comp_AAN_idx.resize( seq_db.NAAN );
	make_comp_short_word_index(options.NAA, NAAN_array, Comp_AAN_idx);
}
@amcdavid
Copy link
Owner

amcdavid commented Oct 6, 2022

Hi @SamaZYX ,

I haven't had the need to cluster unstranded data. Actually, it looks like I over-rode the r=1 default because my use case was to cluster Vh segments, and treating these data as unstranded would return invalid results,

options = c(options, list(ap = 1, r = 0))

Do you have a reproducible example? As a next step, I would try calling cdhit directly, via cdhitC e.g.

seq_cluster_index = cdhitC(options, name, showProgress) + 1
.

@SamaZYX
Copy link
Author

SamaZYX commented Oct 7, 2022

Hi @amcdavid ,

My case is not related to VH nor Single Cell experiments. I was just trying to find if someone that had already made an implementation of CD-HIT in R for general use.

Unfortunately, I am no expert on R yet, and by looking at your scripts I just found out that you can add C code in R with Rcpp.

I tested your suggestion of running cdhitestC and cdhit by creating a new functions with the modified -r 1 parameter (see below) but I still get the same result as if I used r=0.

library("CellaRepertorium")
library("Biostrings")
library("dplyr")

cdhitestC2 <- function(opts, name, showProgress) {
  .Call('_CellaRepertorium_cdhitestC', PACKAGE = 'CellaRepertorium', opts, name, showProgress)
}

cdhit2 = function(seqs, identity = NULL, kmerSize = NULL, min_length = 6, s = 1, G = 1,
                 only_index = FALSE, showProgress = interactive(), ...) {
  if(any(width(seqs) < min_length)) stop("Some sequences shorter than `min_length`;
                                           remove these or decrease min_length")
  name = 'CD-Hit'
  uopts = list(...)
  options = list()
  options$i <- tempfile()
  options$s = s
  writeXStringSet(seqs, options$i)
  on.exit(unlink(options$i))
  type = switch(
    class(seqs),
    AAStringSet = 'cdhitC',
    DNAStringSet = 'cdhitestC',
    stop('seqs must be either AAStringSet or DNAStringSet')
  )
  if(type == 'cdhitestC'){ #DNA
    kmerSize = case_when(identity < .8 ~ 4, identity < .85 ~ 5,
                         identity < .88 ~ 6, identity < .9 ~ 7,
                         identity < .95 ~ 9, identity < 1 ~ 10, TRUE ~ 11)
    options = c(options, list(ap = 1, r = 1))
  } else{
    kmerSize = 5
  }
  options$n = kmerSize
  options$G = G
  options$c = identity
  options$l = min_length - 1
  options = c(uopts, options)
  options = options[!duplicated(names(options))]
  options = lapply(options, as.character)
  if(type == 'cdhitC'){
    seq_cluster_index = CellaRepertorium::cdhitC(options, name, showProgress) + 1
  } else{
    seq_cluster_index = cdhitestC2(options, name, showProgress) + 1
  }
  if(only_index) return(seq_cluster_index)
  tibble::tibble(query_name = names(seqs), seq = as.character(seqs),
                 cluster_idx = seq_cluster_index) %>%
    dplyr::group_by(cluster_idx) %>%
    dplyr::mutate(n_cluster = dplyr::n()) %>% ungroup()
}

tbl = cdhit2(nnseq, identity = 1, only_index = FALSE, g=1, l=10, r=1)

I tested the -r 0 and -r 1 in the original program on unix and in there it seems to work.

I can see that cdhitestC is calling the compiled C version of your cdhitest (cdhit-est.cpp) and I thought that the code in my first post might have been related to the issue.

I know that running cdhitest in both strands is not related to your line of work, however, it would be very nice to be able to use cdhit in your package in other lines of work where clustering of unstranded data is needed.

Thanks for your reply.
Jose

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants