# HNSCC Clinical Data

## Download and Parse TCGA Clinical Data Workflow

### McWeeney Lab, Oregon Health & Science University

** Author: Gabrielle Choonoo (choonoo@ohsu.edu) **

## Introduction

This is the step-by-step workflow for downloading and parsing TCGA Clinical XML files to a tab-deliminated file. 

Required Files:
* Manifest File (.txt)
* Clinical Data (folders containing .XML files)
* This notebook (HNSCC_Clinical_Data_Workflow.ipynb): [[Download here]](https://raw.githubusercontent.com/gchoonoo/HNSCC_Clinical_Data_Notebook/master/HNSCC_Clinical_Data_Workflow.ipynb)

Required python packages:
- `xml.etree.ElementTree`
- `os`
- `csv`

**Note: this notebook can also be downloaded as an R and python script (only the code blocks seen below will be included): 
* [[Download python script here]](https://raw.githubusercontent.com/gchoonoo/HNSCC_Clinical_Data_Notebook/master/parse_clinical_xml.py)
* [[Download R script here]](https://raw.githubusercontent.com/gchoonoo/HNSCC_Clinical_Data_Notebook/master/parse_clinical_xml_part2.r)

** All code is available on GitHub: [https://github.com/gchoonoo/HNSCC_Clinical_Data_Notebook](https://github.com/gchoonoo/HNSCC_Clinical_Data_Notebook) **

## Download Manifest File

### Navigate to the TCGA Data Portal

https://portal.gdc.cancer.gov/

Click on "Data"

Select "Head and Neck" in the primary site section on the left

Select "Clinical Supplement" under the File Counts by Data Type widget

Click "Add all files to cart". Should be 528 files for this timestamp (5/9/17), one XML file for each patient.

Click on your cart.

Click Download and from the pull down menu select "Manifest"

This will download a text file containing the cases you would like to query from the GDC Client database.

## Download the XML Files

Download the GDC Client

https://gdc.cancer.gov/access-data/gdc-data-transfer-tool

Run this line of code in terminal to download files (change to your directory with the manifest file):

gdc-client download -m  /Users/choonoo/HNSCC_Clinical_Data_Notebook/gdc_manifest_20170509_203418.txt

Once these are done downloading, move them into a new folder and you're ready to parse them in python. 

## Run Python script to parse XML files

In [None]:
"""
Import libraries
"""
import xml.etree.ElementTree as et

import os

import csv

In [None]:
"""
Set directory where your downloaded files are
"""

os.chdir("/Users/choonoo/clinical_data_5_9_17")

rootdir = "/Users/choonoo/clinical_data_5_9_17"

"""
Walk through directory and pull list of all nested files
"""

dir_list = []
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        print os.path.join(subdir, file)
        clin_files = os.path.join(subdir, file)
        dir_list.append(clin_files)
        
"""
Extract only the xml files for each patient
"""

dir_list2 = [dir_list for dir_list in dir_list if 'parcel' not in dir_list]

dir_list3 = [dir_list2 for dir_list2 in dir_list2 if 'xml' in dir_list2]

In [None]:
"""
Loop through each XML file and iterate through to extract all nested variables within
"""
for i in range(0,len(dir_list3)):
    new_list = []      
    iter_et = et.iterparse(dir_list3[i], events=['start', 'end'])
    event, root = iter_et.next()
    for event, element in iter_et:
        if event == "end" and element.tag != root.tag:
            print element.tag + ":", element.text
            result = element.tag + ":", element.text
            new_list.append(result) 
            element.clear()

    root.clear() 
    
"""
Write each parsed XML file to a text file in a new folder
"""    
    fh = open("/Users/choonoo/clinical_data_5_9_17_parsed/file" + str(i) + ".txt", 'w')
    w = csv.writer(fh, delimiter='\t')
    w.writerows(new_list[1:])    
    fh.close()  

## Run R script to process text files

In [None]:
# clean NA function
clean_na = function(data_set){
  
  for (i in 1:dim(data_set)[2]){
    print(i)
    if(sum(na.omit(data_set[,i] == "") > 0)){
      
      data_set[which( data_set[,i] == ""),i] <- NA
    }
    
  }
  return(data_set)
}

In [None]:
# set directory where to save parsed data
dir_folder = dir("clinical_data_5_9_17_parsed")

# save empty vector to save parsed data
pat_data <- vector("list",length(dir_folder))

# save empty vector to save union of clinical variables
pat_data_names <- vector("list",length(dir_folder))

# read in the data
for(i in 1:length(dir_folder)){
  
  print(i)
  
  read.delim(paste("clinical_data_5_9_17_parsed", dir_folder[i],sep="/"), header=F, sep="\t") -> pat_data[[i]]
  
  #pat_data_v2 = pat_data[[1]]
  
  pat_data[[i]][,1] <- gsub(":","",sapply(strsplit(as.character(pat_data[[i]][,1]), "}"),"[",2))
  
  
  
  # fix duplicates
  if(sum(duplicated(pat_data[[i]][,1])) > 0){
    
    duplicates = unique(pat_data[[i]][duplicated(pat_data[[i]][,1]),1])
    
    
    for(a in 1:length(duplicates)){
      #print(i)
      pat_data[[i]][pat_data[[i]][,1] %in% duplicates[a],1] <- paste0(pat_data[[i]][pat_data[[i]][,1] %in% duplicates[a],1], seq(1:length(pat_data[[i]][pat_data[[i]][,1] %in% duplicates[a],1])))
      
    }
    
  }
  
  pat_data_v2 = pat_data[[i]]
  
  t(pat_data_v2) -> pat_data_v3
  
  as.data.frame(pat_data_v3) -> pat_data_v3
  
  names(pat_data_v3) <- pat_data_v2[,1]
  
  pat_data_v3[-1,] -> pat_data_v4
  
  print(names(pat_data_v4))
  
  row.names(pat_data_v4) <- pat_data_v4[,"bcr_patient_barcode"]
  
  pat_data[[i]] <- pat_data_v4
  
  pat_data_names[[i]] <- names(pat_data_v4)
  
}

In [None]:
# match names based on union variables between all files (168 variables in total)

unique(unlist(pat_data_names)) -> full_names

for(a in 1:length(pat_data)){
  
  print(a)
  
  if(sum(!full_names %in% names(pat_data[[a]])) > 0){
    
    for(i in full_names[!full_names %in% names(pat_data[[a]])]){
      
      pat_data[[a]][,i] <- NA
      
    }
  }
  
  pat_data[[a]] <- pat_data[[a]][,full_names]
  
}

# rbind all files
clinical_data_full = do.call(rbind,pat_data) 

In [None]:
# clean "\n" to NA
for(i in 1:ncol(clinical_data_full)){
  if(length(grep("\n",clinical_data_full[,i])) > 0){
    
    if(length(levels(clinical_data_full[,i])) > 0){
      levels(clinical_data_full[,i])[grep("\n", levels(clinical_data_full[,i]))] <- NA
      
    }else{
      clinical_data_full[grep("\n",clinical_data_full[,i]),i] <- NA
    }
    
  }
}

# clean blanks to NA
clean_na(clinical_data_full) -> clinical_data_full_v2

# save clinical data
write.table(file="clinical_tcga_data_5_9_17.txt", x=clinical_data_full_v2, sep="\t", quote=F, row.names=F)