Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STARsolo output formatting issues #556

Closed
vpresnyak opened this issue Feb 4, 2019 · 5 comments
Closed

STARsolo output formatting issues #556

vpresnyak opened this issue Feb 4, 2019 · 5 comments
Labels
issue: code Likely to be an issue with STAR code

Comments

@vpresnyak
Copy link
Contributor

Hi @alexdobin,

I'm testing STARsolo for our single-cell use and have run into some issues using the output for downstream analysis. The formatting of the files does not match that of the Cellranger outputs and trips up tools that expect those files as input (e.g. scanpy).

The formatting of genes.tsv seems unusual - it looks like the information alternates between columns. This isn't the case in Cellranger output files:

$ cat ./Solo.out/genes.tsv | head 
ENSG00000223972.5       DDX11L1
WASH7P  ENSG00000278267.1
ENSG00000243485.5       MIR1302-2HG
MIR1302-2       ENSG00000237613.2
ENSG00000268020.3       AL627309.6
OR4G11P ENSG00000186092.5
ENSG00000238009.6       AL627309.1
AL627309.3      ENSG00000233750.3
ENSG00000268903.1       AL627309.7
AL627309.8      ENSG00000239906.1

As a separate issue, the matrix.mtx file is missing its header. This seems to be required by the file type definition (https://math.nist.gov/MatrixMarket/formats.html) and is checked by tools that read the file (cellranger outputs provided for comparison).

$ cat ./Solo.out/matrix.mtx | head
%%
%
58347 737280 6489398
17075   1       1
22155   1       1
40733   1       1
44903   1       1
48401   1       1
50993   1       2
55467   8       1

$ gunzip -c cell_ranger/outs/raw_feature_bc_matrix/matrix.mtx.gz | head
%%MatrixMarket matrix coordinate integer general
%metadata_json: {"format_version": 2, "software_version": "3.0.1"}
33538 737280 6251612
9817 1 1
10931 1 1
12597 1 1
24283 1 1
26812 1 1
28909 1 1
31584 1 2

I'm not sure if I'm doing something funny at the command line, here's how I'm running it:

STAR --soloType Droplet --runThreadN 64 --soloCBwhitelist /scratch/cellranger-2.1.1/cellranger-cs/2.1.1/lib/python/cellranger/barcodes/737K-august-2016.txt --outFileNamePrefix . --genomeDir /scratch/temp_genome --readFilesCommand gunzip -c --readFilesIn ...

@alexdobin
Copy link
Owner

Hi @vpresnyak

have you re-generated the genome index with the 2.7.0a version?
It's supposed to throw an error with an old genome, but maybe that did not work.

There is pull-in request for the matrix header change, I will make a patch tonight.

Cheers
Alex

@vpresnyak
Copy link
Contributor Author

vpresnyak commented Feb 4, 2019

The genome reference was generated fresh from the GRCh38 gencode files. I did try running it with an older genome, which threw an error as it's supposed to.

@alexdobin
Copy link
Owner

Thanks!

I think I found the problem, please try the patch I just uploaded to GitHub master.
I will make a tagged release tomorrow, hopefully.

@bgruening
Copy link

bgruening commented Apr 4, 2019

We think this can be closed ... isn't it?

@alexdobin
Copy link
Owner

It should be working, but I did not hear back from the OP.
Let's close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue: code Likely to be an issue with STAR code
Projects
None yet
Development

No branches or pull requests

3 participants