Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input for the import-rna command #479

Open
SJRussell opened this issue Nov 15, 2019 · 4 comments
Open

Input for the import-rna command #479

SJRussell opened this issue Nov 15, 2019 · 4 comments

Comments

@SJRussell
Copy link

  1. It's not clear what Gene Resource data to download from BioMart. I'm using the built in Hg38 gene info as a template but BioMart doesn't have "NCBI gene ID" or "Transcript support level (TSL)" available for Hg19. I'm going to try to get the original fasta files for the project and remap to Hg38, but it would be great to have a brief overview of what "Gene-resource" info is required when working with non-Hg38 genomes.
  2. I'd also like to build my own cnv expression correlates, but the input requirements for cnv_expression_correlate.py are not clear. Could you point me to a resource for building these inputs?
  3. There are several issues/questions about using counts as input for import-rna. Have the issues been fixed or should I run RSEM instead of HTSeq-count to generate my sample input.

Thanks for your great tool and in advance for your help.

@etal etal added the rna label Nov 18, 2019
@etal
Copy link
Owner

etal commented Nov 18, 2019

Thanks for checking in, and sorry for the trouble.

  1. Offhand I'm not sure what to do about BioMart's lack of support for hg19 etc. Do you know of another BioMart source that might have these, or a way to do it in R?
  2. At the moment the code is the spec, and I agree some proper docs are necessary here. (Tagging this ticket.) Some manual wrangling of the tables is needed.
  3. There's a chance it's been fixed -- if it's quick to check, then try it, otherwise RSEM would be a viable workaround.

@SJRussell
Copy link
Author

Thanks for the response. For now:

  1. I've mapped to hg38 instead of trying to get the right output from biomart. The difficulty was that I didn't know what fields were required for the gene-resource file.
  2. If you update the documentation to include details on how to build custom expression correlates, please mention in this ticket. It seems to me that the accuracy of the algorithm depends on experiment-specific expression correlates.
  3. Upon installing CNVkit with conda and running any cnvkit.py commands, I got this error:
    Traceback (most recent call last): File "/home/stewart/anaconda3/envs/cnvkit/bin/cnvkit.py", line 8, in <module> from cnvlib import commands File "/home/stewart/cnvkit/cnvlib/__init__.py", line 4, in <module> from .cmdutil import read_cna as read File "/home/stewart/cnvkit/cnvlib/cmdutil.py", line 7, in <module> from .cnary import CopyNumArray as CNA File "/home/stewart/cnvkit/cnvlib/cnary.py", line 9, in <module> from . import core, descriptives, params, smoothing File "/home/stewart/cnvkit/cnvlib/smoothing.py", line 152 x, wing, *padded = check_inputs(x, width, False, weights) ^ SyntaxError: invalid syntax
    By reinstalling with pip, the issues seem resolved. I also ran with -f counts and it appears to give log2 values, suggesting that counts from STAR or HTSeq-count can be used.

@etal
Copy link
Owner

etal commented Nov 29, 2019

Thanks for the feedback. I'll roll another release for the sake of getting the latest fixes out to the world, and then see about replicating and documenting the process of creating the gene resource and cnv-expression correlates.

@SJRussell
Copy link
Author

SJRussell commented Dec 6, 2019

Much appreciated.
Do you have any suggestions for cleaning up the calls I'm getting? So far I've tried specifying normal samples, using --no-txlen, --max-log2 2, and segment -m none. The PDFs I've attached are with 3 normal samples specified, using counts, and with the rest of the parameters default. The total input was 15 RNA seq samples. As you can see, the XY normal sample segments are still quite variable. In the -16 samples, there is a clear decrease in log2 for chrom 16. However in the XO samples there is no clear decrease in XO (this could be due to dosage compensation or the fact that the population contains both XX and XY samples). Any suggestions on how to bring the baseline closer to 0 and reduce variability? Thanks!

normal1.pdf
normal2.pdf
normal3.pdf
XO-1.pdf
XO-2.pdf
XO-3.pdf
minus16-1.pdf
minus16-2.pdf
minus16-3.pdf
plus16-1.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants