In [1]:
library(tidyverse)
library(TCGAbiolinks)
library(HDF5Array)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✔ ggplot2 3.3.2     ✔ purrr   0.3.4
✔ tibble  3.0.3     ✔ dplyr   1.0.0
✔ tidyr   1.1.0     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Loading required package: DelayedArray
Loading required package: stats4
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following object is masked from ‘package:dplyr’:

    count

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:dplyr’:

    co

# Constants

In [6]:
dirs <- rutils::get_dev_directories(dev_paths_file = "../dev_paths.txt")
projects <- c("TCGA-CESC", "TCGA-OV", "TCGA-UCS", "TCGA-UCEC")
project_paths <- unlist(map(projects, function(prj) paste0(dirs$data_dir, "/", prj)))

# Functions

In [43]:
protein_expr_query <- function(p) {
    return(
        GDCquery(
            project = p,
            data.category = "Protein expression",
            legacy = TRUE
        )
    )
}


get_participant_id <- function(barcode) {
    return(
        unlist(str_split(barcode, "-"))[3]
    )
}

In [18]:
proj_idx <- 1
q <- protein_expr_query(projects[proj_idx])
GDCdownload(q, method = "api", directory = paste0(dirs$data_dir, "/tcga_biolinks_downloads"), files.per.chunk = 10)
data <- GDCprepare(q, directory = paste0(dirs$data_dir, "/tcga_biolinks_downloads"))

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg19
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-CESC
--------------------
oo Filtering results
--------------------
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-CESC
GDCdownload will download 173 files. A total of 1.004929 MB
Downloading chunk 1 of 18 (10 files, size = 57.946 KB) as Mon_Aug_17_17_55_36_2020_0.tar.gz


Downloading: 21 kB      

Downloading chunk 2 of 18 (10 files, size = 58.403 KB) as Mon_Aug_17_17_55_36_2020_1.tar.gz


Downloading: 21 kB      

Downloading chunk 3 of 18 (10 files, size = 58.021 KB) as Mon_Aug_17_17_55_36_2020_2.tar.gz


Downloading: 22 kB      

Downloading chunk 4 of 18 (10 files, size = 58.121 KB) as Mon_Aug_17_17_55_36_2020_3.tar.gz


Downloading: 22 kB      

Downloading chunk 5 of 18 (10 files, size = 58.611 KB) as Mon_Aug_17_17_55_36_2020_4.tar.gz


Downloading: 22 kB      

Downloading chunk 6 of 18 (10 files, size = 59.24 KB) as Mon_Aug_17_17_55_36_2020_5.tar.gz


Downloading: 22 kB      

Downloading chunk 7 of 18 (10 files, size = 57.932 KB) as Mon_Aug_17_17_55_36_2020_6.tar.gz


Downloading: 21 kB      

Downloading chunk 8 of 18 (10 files, size = 58.55 KB) as Mon_Aug_17_17_55_36_2020_7.tar.gz


Downloading: 21 kB      

Downloading chunk 9 of 18 (10 files, size = 57.86 KB) as Mon_Aug_17_17_55_36_2020_8.tar.gz


Downloading: 21 kB      

Downloading chunk 10 of 18 (10 files, size = 57.794 KB) as Mon_Aug_17_17_55_36_2020_9.tar.gz


Downloading: 21 kB      

Downloading chunk 11 of 18 (10 files, size = 58.378 KB) as Mon_Aug_17_17_55_36_2020_10.tar.gz


Downloading: 21 kB      

Downloading chunk 12 of 18 (10 files, size = 57.766 KB) as Mon_Aug_17_17_55_36_2020_11.tar.gz


Downloading: 21 kB      

Downloading chunk 13 of 18 (10 files, size = 57.52 KB) as Mon_Aug_17_17_55_36_2020_12.tar.gz


Downloading: 21 kB      

Downloading chunk 14 of 18 (10 files, size = 57.869 KB) as Mon_Aug_17_17_55_36_2020_13.tar.gz


Downloading: 21 kB      

Downloading chunk 15 of 18 (10 files, size = 57.535 KB) as Mon_Aug_17_17_55_36_2020_14.tar.gz


Downloading: 21 kB      

Downloading chunk 16 of 18 (10 files, size = 57.838 KB) as Mon_Aug_17_17_55_36_2020_15.tar.gz


Downloading: 21 kB      

Downloading chunk 17 of 18 (10 files, size = 57.943 KB) as Mon_Aug_17_17_55_36_2020_16.tar.gz


Downloading: 21 kB      

Downloading chunk 18 of 18 (3 files, size = 17.602 KB) as Mon_Aug_17_17_55_36_2020_17.tar.gz




In [20]:
coldata_df <- read_tsv(paste0(dirs$data_dir, "/unified_cervical_data/coldata.tsv"))

Parsed with column specification:
cols(
  sample_name = col_character(),
  condition = col_character(),
  data_source = col_character()
)


In [52]:
RNA_sample_names <- (coldata_df %>%
    dplyr::filter(data_source == "TCGA"))$sample_name

In [33]:
colnames(data[, -1])[1:10]

In [53]:
RNA_participants <- map(.f = get_participant_id, RNA_sample_names)
protein_participants <- map(.f = get_participant_id, colnames(data[, -1]))

In [58]:
length(intersect(RNA_participants, protein_participants))