## Query TCGA to build a download manifest for DLBC RNA-seq count data

This queries the NIH Genomic Data Commons (GDC) API to identify and retrieve metadata for RNA-seq gene expression files from the TCGA Diffuse Large B-Cell Lymphoma (DLBC) project. The query is restricted to transcriptome profiling data generated using the STAR alignment workflow with raw count outputs, ensuring that the resulting files are appropriate for downstream gene expression analysis.

The API request specifies several filters, including the TCGA-DLBC project identifier, the data category and type corresponding to gene expression quantification, and the STAR â€“ Counts workflow. Metadata fields such as file identifiers, filenames, checksums, and file sizes are requested so that the files can be reliably downloaded and verified.

The returned JSON response is converted into a tabular format using pandas, and a GDC-compatible manifest file is constructed containing the minimum required columns for bulk download. This manifest is saved as a tab-separated file and is later used by the GDC Data Transfer Tool to download the raw RNA-seq count files in a reproducible manner.

In [3]:
import json
import requests
import pandas as pd

FILES = "https://api.gdc.cancer.gov/files"

filters = {
  "op": "and",
  "content": [
    {"op":"in","content":{"field":"cases.project.project_id","value":["TCGA-DLBC"]}},
    {"op":"in","content":{"field":"data_category","value":["Transcriptome Profiling"]}},
    {"op":"in","content":{"field":"data_type","value":["Gene Expression Quantification"]}},
    {"op":"in","content":{"field":"analysis.workflow_type","value":["STAR - Counts"]}}
  ]
}

params = {
  "filters": json.dumps(filters),
  "fields": ",".join([
    "file_id","file_name","md5sum","file_size",
    "cases.submitter_id","cases.samples.sample_type",
    "analysis.workflow_type"
  ]),
  "format": "JSON",
  "size": "2000"
}

r = requests.get(FILES, params=params, timeout=60)
r.raise_for_status()
hits = r.json()["data"]["hits"]

df = pd.json_normalize(hits)

# Build a GDC manifest (minimum required columns)
manifest = df[["file_id","file_name","md5sum","file_size"]].copy()
manifest.columns = ["id","filename","md5","size"]

manifest_path = "manifests/tcga_dlbc_star_counts_manifest.tsv"
manifest.to_csv(
    "../manifests/tcga_dlbc_star_counts_manifest.tsv",
    sep="\t",
    index=False
)

len(manifest), manifest_path


(48, 'manifests/tcga_dlbc_star_counts_manifest.tsv')