Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

brca_tcga_pan_can_atlas_2018 data_sv.txt is not formatted correctly #1820

Open
mjsteinbaugh opened this issue May 3, 2023 · 4 comments
Open

Comments

@mjsteinbaugh
Copy link

Hi cBioPortal team,

I noticed a parsing issue with data_sv.txt from the brca_tcga_pan_can_atlas_2018 dataset.

Here's a reproducible example in R:

## using pipette package
## https://github.com/acidgenomics/r-pipette
con <- "https://github.com/cBioPortal/datahub/raw/master/public/brca_tcga_pan_can_atlas_2018/data_sv.txt"
tsv <- pipette::import(con = con, format = "tsv", engine = "base")
→ Importing <https://github.com/cBioPortal/datahub/raw/master/public/brca_tcga_pan_can_atlas_2018/data_sv.txt> using base::`read.table()`.
Error in (function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",  :
  more columns than column names
Calls: <Anonymous> ... import -> .local -> do.call -> do.call -> <Anonymous>

This file is not correctly formatted, and contains a mismatch of 13 and 17 columns per line.
I'm happy to provide a pull request to fix this issue.

Best,
Mike

@mjsteinbaugh
Copy link
Author

Here's a slightly more informative message from data.table's fread engine:

> tsv <- pipette::import(con = con, format = "tsv", engine = "data.table")
→ Importing <https://github.com/cBioPortal/datahub/raw/master/public/brca_tcga_pan_can_atlas_2018/data_sv.txt> using data.table::`fread()`.
Warning in (function (input = "", file = NULL, text = NULL, cmd = NULL,  :
  Detected 13 column names but the data has 17 columns (i.e. invalid file). Added 4 extra default column names at the end.
Calls: <Anonymous> ... import -> .local -> do.call -> do.call -> <Anonymous>

@mjsteinbaugh
Copy link
Author

Here's a fixed version of the file that now works with cBioPortalData in Bioconductor

data_sv.txt

@rmadupuri
Copy link
Collaborator

rmadupuri commented May 9, 2023

Thanks for noticing that and sharing the fix @mjsteinbaugh. We will update it in the portal.

@mjsteinbaugh
Copy link
Author

@rmadupuri I noticed a few other files aren't formatted correctly. @LiNk-NY and I are putting together a list and I'll update the thread here. Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants