Option -r 1 in cdhit-est does not work #13

SamaZYX · 2022-10-05T13:46:58Z

I am trying to us the cdhit command to cluster nucleotide sequences in both strands but it does not seem to work.

The option "-r 1" (default parameter) tells cdhit to search in both strands but no matter what value I use, it is not reading it.

Could this be related to the missing part in cdhit-est.cpp from the original code?

if ( options.option_r ) {
	Comp_AAN_idx.resize( seq_db.NAAN );
	make_comp_short_word_index(options.NAA, NAAN_array, Comp_AAN_idx);
}

amcdavid · 2022-10-06T16:44:50Z

Hi @SamaZYX ,

I haven't had the need to cluster unstranded data. Actually, it looks like I over-rode the r=1 default because my use case was to cluster Vh segments, and treating these data as unstranded would return invalid results,

CellaRepertorium/R/cdhit-methods.R

Line 77 in 27c07ab

options = c(options, list(ap = 1, r = 0))

Do you have a reproducible example? As a next step, I would try calling cdhit directly, via cdhitC e.g.

CellaRepertorium/R/cdhit-methods.R

Line 89 in 27c07ab

seq_cluster_index = cdhitC(options, name, showProgress) + 1

.

SamaZYX · 2022-10-07T08:51:33Z

Hi @amcdavid ,

My case is not related to VH nor Single Cell experiments. I was just trying to find if someone that had already made an implementation of CD-HIT in R for general use.

Unfortunately, I am no expert on R yet, and by looking at your scripts I just found out that you can add C code in R with Rcpp.

I tested your suggestion of running cdhitestC and cdhit by creating a new functions with the modified -r 1 parameter (see below) but I still get the same result as if I used r=0.

library("CellaRepertorium")
library("Biostrings")
library("dplyr")

cdhitestC2 <- function(opts, name, showProgress) {
  .Call('_CellaRepertorium_cdhitestC', PACKAGE = 'CellaRepertorium', opts, name, showProgress)
}

cdhit2 = function(seqs, identity = NULL, kmerSize = NULL, min_length = 6, s = 1, G = 1,
                 only_index = FALSE, showProgress = interactive(), ...) {
  if(any(width(seqs) < min_length)) stop("Some sequences shorter than `min_length`;
                                           remove these or decrease min_length")
  name = 'CD-Hit'
  uopts = list(...)
  options = list()
  options$i <- tempfile()
  options$s = s
  writeXStringSet(seqs, options$i)
  on.exit(unlink(options$i))
  type = switch(
    class(seqs),
    AAStringSet = 'cdhitC',
    DNAStringSet = 'cdhitestC',
    stop('seqs must be either AAStringSet or DNAStringSet')
  )
  if(type == 'cdhitestC'){ #DNA
    kmerSize = case_when(identity < .8 ~ 4, identity < .85 ~ 5,
                         identity < .88 ~ 6, identity < .9 ~ 7,
                         identity < .95 ~ 9, identity < 1 ~ 10, TRUE ~ 11)
    options = c(options, list(ap = 1, r = 1))
  } else{
    kmerSize = 5
  }
  options$n = kmerSize
  options$G = G
  options$c = identity
  options$l = min_length - 1
  options = c(uopts, options)
  options = options[!duplicated(names(options))]
  options = lapply(options, as.character)
  if(type == 'cdhitC'){
    seq_cluster_index = CellaRepertorium::cdhitC(options, name, showProgress) + 1
  } else{
    seq_cluster_index = cdhitestC2(options, name, showProgress) + 1
  }
  if(only_index) return(seq_cluster_index)
  tibble::tibble(query_name = names(seqs), seq = as.character(seqs),
                 cluster_idx = seq_cluster_index) %>%
    dplyr::group_by(cluster_idx) %>%
    dplyr::mutate(n_cluster = dplyr::n()) %>% ungroup()
}

tbl = cdhit2(nnseq, identity = 1, only_index = FALSE, g=1, l=10, r=1)

I tested the -r 0 and -r 1 in the original program on unix and in there it seems to work.

I can see that cdhitestC is calling the compiled C version of your cdhitest (cdhit-est.cpp) and I thought that the code in my first post might have been related to the issue.

I know that running cdhitest in both strands is not related to your line of work, however, it would be very nice to be able to use cdhit in your package in other lines of work where clustering of unstranded data is needed.

Thanks for your reply.
Jose

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option -r 1 in cdhit-est does not work #13

Option -r 1 in cdhit-est does not work #13

SamaZYX commented Oct 5, 2022

amcdavid commented Oct 6, 2022

SamaZYX commented Oct 7, 2022

Option -r 1 in cdhit-est does not work #13

Option -r 1 in cdhit-est does not work #13

Comments

SamaZYX commented Oct 5, 2022

amcdavid commented Oct 6, 2022

SamaZYX commented Oct 7, 2022