# Generated TCGA Matrix Creation Scripts
```
Andrew Davidson
aedavids@ucsc.edu
4/27/22
```
ref: extraCellularRNA/terra/jupyterNotebooks/createFailedSampleDataSet-TCGA.{ipynb,html}

The salmonTarQuantWorkflow.wdl was run for each of the Terra TCGA workspaces. The next part of our reasearch requires all the salmon counts be gathered into a single matrix. We are unable to do this using Terra. Our work around is to copy the quant files to a GCP native project and use apache spark to create the matricies. 

This notebook creates
1. the gsutil scripts requried to transfer the terra workspace files to the GCP native project. The natve project can not access the workspace file, directly. how ever gsutil is able to copy
2. The corresponding colData.csv files for each workspace. This is the meta data need for future processing

createFailedSampleDataSet-TCGA.{ipynb,html} demonstrates how to identifyed the samples we want to construct or matricies from

In [1]:
from datetime import datetime
now = datetime.now()
today = now.strftime('%Y-%m-%d')
currentTime = now.strftime('%H:%M:%S')
print("run on {}".format( today +  " " + currentTime ))

import numpy as np
import pandas as pd

run on 2022-04-28 17:14:18


In [2]:
# define the bucket in the native gcp project we need to copy the files to 
DESTINATION_BUCKET_ID = "AEDWIP_BUCKET_ID"

In [3]:
# back ups of terra data models are stored in a separate repo
# so that branch merges do not loose data model version
rootDir = "../../../terraDataModels/test-aedavids-proj/TCGA"
listOfWorkSpacePath = rootDir + "/" + "listOfWorkSpaces.csv" 
workspaceDF = pd.read_csv( listOfWorkSpacePath )
workSpaceNamesList = workspaceDF.loc[:, "wokspace"].to_list()

# Find column with results generated by salmonTarQuantWorkflow v 4
This workflow uses the salmon paired read bug fix. The reson the name is not always the the same is that there are 33 different workspaces. I had to name the output col manual for each run. For unknow reason Terra would not let define these values in json

In [4]:
def readDataModel( rootDir, workspaceName, entityName ) :
    '''
    entity referers to one of the terra data model tsv files. for exzmple 'sample'
    '''
    dataModelTSV = rootDir + "/" + workspaceName + "/" + entityName + ".tsv"
    dataModelDF = pd.read_csv(dataModelTSV, delimiter='\t')
    return dataModelDF

In [5]:
def findQuantFileColumnName(workSpaceNamesList) :
    '''
    loads most of the tcga sample tsv files. See code for workspaces that are where skipped.
    they where skipped because we not all the expected samples ran.
    
    returns a dictionary.
        key is the workspaceName
        value is (quantColName, df)
    '''
    dataDict = dict()

    # we need to dig into these workspaces to figure out. why run failed
    # check out the minimap and star wdls for a good example of how to 
    # works with single end, paired end and multple replicants fastq files
    # there is s a good chance that is the source of our bugs
    missingList = ['TCGA_DLBC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
                   ,'TCGA_GBM_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
                   ,'TCGA_LAML_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
                   ,'TCGA_SARC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
                  ]

    for workspaceName in workSpaceNamesList:
        if workspaceName in missingList:
            continue

        #print(workspaceName)
        df = readDataModel( rootDir, workspaceName, entityName = "sample" )
        quantMatchlist = [s for s in df.columns if "quantF" in s]
        colName = [s for s in quantMatchlist if "3" in s][0]
        #print(colName)
        #print()
        dataDict[workspaceName] = (colName, df) 
        
    return dataDict
        
dataDict = findQuantFileColumnName(workSpaceNamesList)        

## Find samples with quant files. and sample that failed and need to re-run

In [6]:
def findSamplesWithMissingQuantFile(rootDir, workspaceName, quantFileColName):
    '''
    returns a data frame of failed samples
    '''
    sampleDF = readDataModel( rootDir, workspaceName, entityName = "sample" )     
    print("\nworkspace: {}".format(workspaceName))
    
    mRNARowsLogicalPS = sampleDF['mRNASeq_fastq_path'].notna()
    nunMRNAFiles = sum( mRNARowsLogicalPS )
    print("number of mRNASeq_fastq_path files: {}".format(nunMRNAFiles))
    mRNA_DF = sampleDF.loc[mRNARowsLogicalPS,:]

    # find rows that have fastq files but are missing results. ie 'quantFilePaired' value
    passedSamplesLogicalPS = mRNA_DF[quantFileColName].notna()
    numPassed = sum(passedSamplesLogicalPS) 
    
    passedSamplesDF = mRNA_DF.loc[passedSamplesLogicalPS,:]
    
    failedSampleLogicalPS =  mRNA_DF[quantFileColName].isna()
    failedSampleDF = mRNA_DF.loc[failedSampleLogicalPS,:]
    
    numFailed = nunMRNAFiles - numPassed
   
    print("num passed:{}".format(numPassed))
    print("num failed:{}".format(numFailed))
    
    return failedSampleDF

In [7]:
def findSamplesWithQuantFile(sampleDF, quantFileColName):   
    '''
    returns a data frame with all these samples that have quant files
    '''
    mRNARowsLogicalPS = sampleDF['mRNASeq_fastq_path'].notna()
    nunMRNAFiles = sum( mRNARowsLogicalPS )
    print("number of mRNASeq_fastq_path files: {}".format(nunMRNAFiles))
    mRNA_DF = sampleDF.loc[mRNARowsLogicalPS,:]

    # find rows that have fastq files but are missing results. ie 'quantFilePaired' value
    passedSamplesLogicalPS = mRNA_DF[quantFileColName].notna()
    numPassed = sum(passedSamplesLogicalPS) 
    
    passedSamplesDF = mRNA_DF.loc[passedSamplesLogicalPS,:]
    
    failedSampleLogicalPS =  mRNA_DF[quantFileColName].isna()
    failedSampleDF = mRNA_DF.loc[failedSampleLogicalPS,:]
    
    numFailed = nunMRNAFiles - numPassed
   
    print("num passed:{}".format(numPassed))
    print("num failed:{}".format(numFailed))
    
    return passedSamplesDF

In [8]:
# test data set
#
# workspace: TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
# number of mRNASeq_fastq_path files: 177
# num passed:105
# num failed:72
# failedSampleDF.shape:(72, 63)
# quantFile3

def testfindSamplesWithQuantFiles():
    workspaceName = "TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab"
    quantColName, sampleDF = dataDict[workspaceName]
    print("workspaceName: {} quantColName: {}".format(workspaceName, quantColName))

    quantDF = findSamplesWithQuantFile(sampleDF, quantColName)
    print("quantDF.shape:{}".format(quantDF.shape))
    
testfindSamplesWithQuantFiles()

workspaceName: TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab quantColName: quantFile3
number of mRNASeq_fastq_path files: 177
num passed:105
num failed:72
quantDF.shape:(105, 63)
