# Generated TCGA Matrix Creation Scripts
```
Andrew Davidson
aedavids@ucsc.edu
4/27/22
```
ref: extraCellularRNA/terra/jupyterNotebooks/createFailedSampleDataSet-TCGA.{ipynb,html}

The salmonTarQuantWorkflow.wdl was run for each of the Terra TCGA workspaces. The next part of our reasearch requires all the salmon counts be gathered into a single matrix. We are unable to do this using Terra. Our work around is to copy the quant files to a GCP native project and use apache spark to create the matricies. 

This notebook creates
1. the gsutil scripts requried to transfer the terra workspace files to the GCP native project. The natve project can not access the workspace file, directly. how ever gsutil is able to copy
2. The corresponding colData.csv files for each workspace. This is the meta data need for future processing

createFailedSampleDataSet-TCGA.{ipynb,html} demonstrates how to identifyed the samples we want to construct or matricies from

In [1]:
from datetime import datetime
now = datetime.now()
today = now.strftime('%Y-%m-%d')
currentTime = now.strftime('%H:%M:%S')
print("run on {}".format( today +  " " + currentTime ))

import numpy as np
import pandas as pd
from pathlib import Path

run on 2022-04-29 16:37:08


In [2]:
# define the bucket in the native gcp project we need to copy the files to 
DESTINATION_BUCKET_ID = "anvil-tcga-edu-ucsc-kiim-lab-spark"

In [3]:
# back ups of terra data models are stored in a separate repo
# so that branch merges do not loose data model version
rootDir = "../../../terraDataModels/test-aedavids-proj/TCGA"
listOfWorkSpacePath = rootDir + "/" + "listOfWorkSpaces.csv" 
workspaceDF = pd.read_csv( listOfWorkSpacePath )
workSpaceNamesList = workspaceDF.loc[:, "wokspace"].to_list()

# Find column with results generated by salmonTarQuantWorkflow v 4
This workflow uses the salmon paired read bug fix. The reson the name is not always the the same is that there are 33 different workspaces. I had to name the output col manual for each run. For unknow reason Terra would not let define these values in json

In [4]:
def readDataModel( rootDir, workspaceName, entityName ) :
    '''
    entity referers to one of the terra data model tsv files. for exzmple 'sample'
    '''
    dataModelTSV = rootDir + "/" + workspaceName + "/" + entityName + ".tsv"
    dataModelDF = pd.read_csv(dataModelTSV, delimiter='\t')
    return dataModelDF

In [5]:
def findQuantFileColumnName(workSpaceNamesList, skipWorkspaceList) :
    '''
    loads most of the tcga sample tsv files. See code for workspaces that are where skipped.
    they where skipped because we not all the expected samples ran.
    
    returns a dictionary.
        key is the workspaceName
        value is (quantColName, df)
    '''
    dataDict = dict()


    for workspaceName in workSpaceNamesList:
        if workspaceName in skipWorkspaceList:
            continue

        #print(workspaceName)
        df = readDataModel( rootDir, workspaceName, entityName = "sample" )
        quantMatchlist = [s for s in df.columns if "quantF" in s]
        colName = [s for s in quantMatchlist if "3" in s][0]
        #print(colName)
        #print()
        dataDict[workspaceName] = (colName, df) 
        
    return dataDict
        
# we need to dig into these workspaces to figure out. why run failed
# check out the minimap and star wdls for a good example of how to 
# works with single end, paired end and multple replicants fastq files
# there is s a good chance that is the source of our bugs
skipWorkspaceList = ['TCGA_DLBC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
               ,'TCGA_GBM_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
               ,'TCGA_LAML_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
               ,'TCGA_SARC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
              ]
dataDict = findQuantFileColumnName(workSpaceNamesList, skipWorkspaceList)        

## Find samples with quant files. and sample that failed and need to re-run

In [6]:
def findSamplesWithMissingQuantFile(rootDir, workspaceName, quantFileColName):
    '''
    returns a data frame of failed samples
    '''
    sampleDF = readDataModel( rootDir, workspaceName, entityName = "sample" )     
    print("\nworkspace: {}".format(workspaceName))
    
    mRNARowsLogicalPS = sampleDF['mRNASeq_fastq_path'].notna()
    nunMRNAFiles = sum( mRNARowsLogicalPS )
    print("number of mRNASeq_fastq_path files: {}".format(nunMRNAFiles))
    mRNA_DF = sampleDF.loc[mRNARowsLogicalPS,:]

    # find rows that have fastq files but are missing results. ie 'quantFilePaired' value
    passedSamplesLogicalPS = mRNA_DF[quantFileColName].notna()
    numPassed = sum(passedSamplesLogicalPS) 
    
    passedSamplesDF = mRNA_DF.loc[passedSamplesLogicalPS,:]
    
    failedSampleLogicalPS =  mRNA_DF[quantFileColName].isna()
    failedSampleDF = mRNA_DF.loc[failedSampleLogicalPS,:]
    
    numFailed = nunMRNAFiles - numPassed
   
    print("num passed:{}".format(numPassed))
    print("num failed:{}".format(numFailed))
    
    return failedSampleDF

In [7]:
def findSamplesWithQuantFile(sampleDF, quantFileColName):   
    '''
    returns a data frame with all these samples that have quant files
    '''
    mRNARowsLogicalPS = sampleDF['mRNASeq_fastq_path'].notna()
    nunMRNAFiles = sum( mRNARowsLogicalPS )
    print("number of mRNASeq_fastq_path files: {}".format(nunMRNAFiles))
    mRNA_DF = sampleDF.loc[mRNARowsLogicalPS,:]

    # find rows that have fastq files but are missing results. ie 'quantFilePaired' value
    passedSamplesLogicalPS = mRNA_DF[quantFileColName].notna()
    numPassed = sum(passedSamplesLogicalPS) 
    
    passedSamplesDF = mRNA_DF.loc[passedSamplesLogicalPS,:]
    
    failedSampleLogicalPS =  mRNA_DF[quantFileColName].isna()
    failedSampleDF = mRNA_DF.loc[failedSampleLogicalPS,:]
    
    numFailed = nunMRNAFiles - numPassed
   
    print("num passed:{}".format(numPassed))
    print("num failed:{}".format(numFailed))
    
    return passedSamplesDF

In [8]:
# test data set
#
# workspace: TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
# number of mRNASeq_fastq_path files: 177
# num passed:105
# num failed:72
# failedSampleDF.shape:(72, 63)
# quantFile3

def testfindSamplesWithQuantFiles():
    workspaceName = "TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab"
    quantColName, sampleDF = dataDict[workspaceName]
    print("workspaceName: {} quantColName: {}".format(workspaceName, quantColName))

    quantDF = findSamplesWithQuantFile(sampleDF, quantColName)
    print("quantDF.shape:{}".format(quantDF.shape))
    numQuantFiles = quantDF.shape[0]
    assert (numQuantFiles == 105), "ERROR expected 105 quant files in " + workspaceName
    
#     print("quantDF.head():\n{}".format(quantDF.head()))
    
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', None)
    debugCols = ['entity:sample_id', 'participant', quantColName]
    debugDF = quantDF.loc[:, debugCols]
    print("debugDF.head():\n{}".format(debugDF.head()) )
    pd.reset_option('display.max_rows')
    pd.reset_option('display.max_columns')
    pd.reset_option('display.width')
    pd.reset_option('display.max_colwidth')
    
testfindSamplesWithQuantFiles()

workspaceName: TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab quantColName: quantFile3
number of mRNASeq_fastq_path files: 177
num passed:105
num failed:72
quantDF.shape:(105, 63)
debugDF.head():
   entity:sample_id   participant  \
1   READ-AF-2687-TP  READ-AF-2687   
3   READ-AF-2689-NT  READ-AF-2689   
6   READ-AF-2690-TP  READ-AF-2690   
8   READ-AF-2691-NT  READ-AF-2691   
11  READ-AF-2692-NT  READ-AF-2692   

                                                                                                                                                                                     quantFile3  
1   gs://fc-secure-8a69fc00-b6c9-4179-aee5-f1e47a4475dd/34b2bbfb-4f9a-41d4-bfd8-b55a8e1987de/quantify/ef52a514-2fc0-4e85-946a-2bbbbc56ab96/call-salmon_paired_reads/READ-AF-2687-TP.quant.sf.gz  
3   gs://fc-secure-8a69fc00-b6c9-4179-aee5-f1e47a4475dd/34b2bbfb-4f9a-41d4-bfd8-b55a8e1987de/quantify/5b5b1621-f465-482b-a098-3c34a83ebda3/call-salmon_paired_reads/READ-AF-2689-NT.quant.

## Create col data

In [9]:
def createColDataDataFrame(rootDir, workspaceName, quantDF):
    
    participantDF = readDataModel( rootDir, workspaceName, entityName = "participant")
    
    
#     some patients have multiple samples    
#     quantParticipantSeries = quantDF["participant"]
    
#     print("num Unique quantParticipantSeries.shape :{}".format(quantParticipantSeries.unique().shape) )
#     participantSeries = participantDF["entity:participant_id"]
    
#     selectRows =  participantSeries.isin( quantParticipantSeries )
#     colDataDF = participantDF.loc[selectRows, :]
    
    
    # merge implements inner join. ie sql 'select where'
    retDF = pd.merge(participantDF, quantDF, 
                     left_on = "entity:participant_id",
                     right_on = "participant" )
    
#     for c in retDF.columns:
#         print(c)
    
    # ??? not tissue id or site id ???
    retCols = ['entity:sample_id'
               ,'entity:participant_id'
               , 'tcga_sample_id'
               , 'Cohort'
               , 'Age'
               , 'Gender'
               , 'sample_type'
              ]
    
    return retDF.loc[:, retCols].sort_values(by="entity:sample_id", ascending=True)
    
def testCreateColDataDataFrame():
    workspaceName = "TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab"
    quantColName, sampleDF = dataDict[workspaceName]
    print("workspaceName: {} quantColName: {}".format(workspaceName, quantColName))

    quantDF = findSamplesWithQuantFile(sampleDF, quantColName)
    print("quantDF.shape:{}".format(quantDF.shape))    

    colDataDF = createColDataDataFrame(rootDir, workspaceName, quantDF)
    print("colDataDF.shape:{}".format(colDataDF.shape)) 
    for c in colDataDF.columns:
        print(c)

    print()
    print(colDataDF.head())
    
testCreateColDataDataFrame()

workspaceName: TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab quantColName: quantFile3
number of mRNASeq_fastq_path files: 177
num passed:105
num failed:72
quantDF.shape:(105, 63)
colDataDF.shape:(105, 7)
entity:sample_id
entity:participant_id
tcga_sample_id
Cohort
Age
Gender
sample_type

  entity:sample_id entity:participant_id   tcga_sample_id Cohort  Age  Gender  \
0  READ-AF-2687-TP          READ-AF-2687  TCGA-AF-2687-01   READ   57    male   
1  READ-AF-2689-NT          READ-AF-2689  TCGA-AF-2689-11   READ   41  female   
2  READ-AF-2690-TP          READ-AF-2690  TCGA-AF-2690-01   READ   76  female   
3  READ-AF-2691-NT          READ-AF-2691  TCGA-AF-2691-11   READ   48  female   
4  READ-AF-2692-NT          READ-AF-2692  TCGA-AF-2692-11   READ   54  female   

  sample_type  
0          TP  
1          NT  
2          TP  
3          NT  
4          NT  


## Create script to copy quant files from terra to native gcp project bucket
We can run spark n the native project

In [10]:
def createCopyCommand( quantFilesSeries, workspaceName, dstBucketId):
    numQuantfiles = quantFilesSeries.shape[0]
    retList = [""] * numQuantfiles
    for i in range(numQuantfiles):
        file = quantFilesSeries.iloc[i]
        retList[i] = "gsutil -m cp {} gs://{}/{}/".format(file, dstBucketId, workspaceName)
        
    return retList
    
def testCreateCopyCommand():
    workspaceName = "TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab"
    quantColName, sampleDF = dataDict[workspaceName]
    print("workspaceName: {} quantColName: {}".format(workspaceName, quantColName))

    quantDF = findSamplesWithQuantFile(sampleDF, quantColName)
    print("quantDF.shape:{}".format(quantDF.shape))
    numQuantFiles = quantDF.shape[0]
    assert (numQuantFiles == 105), "ERROR expected 105 quant files in " + workspaceName
    
    quantFiles = quantDF.loc[:,quantColName]
    shellScriptList = createCopyCommand( quantFiles, workspaceName, DESTINATION_BUCKET_ID)
    print("first copy comand")
    print(shellScriptList[0])
    print("\n second copy command")
    print(shellScriptList[1])
    
    assert len(shellScriptList) == numQuantFiles, "ERROR expected {} copy commands".format(numQuantFiles)
    
testCreateCopyCommand()

workspaceName: TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab quantColName: quantFile3
number of mRNASeq_fastq_path files: 177
num passed:105
num failed:72
quantDF.shape:(105, 63)
first copy comand
gsutil -m cp gs://fc-secure-8a69fc00-b6c9-4179-aee5-f1e47a4475dd/34b2bbfb-4f9a-41d4-bfd8-b55a8e1987de/quantify/ef52a514-2fc0-4e85-946a-2bbbbc56ab96/call-salmon_paired_reads/READ-AF-2687-TP.quant.sf.gz gs://anvil-tcga-edu-ucsc-kiim-lab-spark/TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/

 second copy command
gsutil -m cp gs://fc-secure-8a69fc00-b6c9-4179-aee5-f1e47a4475dd/34b2bbfb-4f9a-41d4-bfd8-b55a8e1987de/quantify/5b5b1621-f465-482b-a098-3c34a83ebda3/call-salmon_paired_reads/READ-AF-2689-NT.quant.sf.gz gs://anvil-tcga-edu-ucsc-kiim-lab-spark/TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/


# Create matrix scripts and col data for all workspaces

In [11]:
def run(rootDir, dstBucketId, workspaceName, quantColName, sampleDF):
    '''
    finds quant.sf files, creates colData.csv file, and script to copy quant files
    from terra to gcp bucket we and read using apache spark
    '''
    
    quantDF = findSamplesWithQuantFile(sampleDF, quantColName)
    print("quantDF.shape:{}".format(quantDF.shape))    

    colDataDF = createColDataDataFrame(rootDir, workspaceName, quantDF)
    print("colDataDF.shape:{}".format(colDataDF.shape)) 

    quantFiles = quantDF.loc[:,quantColName]
    shellScriptList = createCopyCommand( quantFiles, workspaceName, dstBucketId)

    # save 
    outDirPath = Path(rootDir + "/" + workspaceName + "/" +  "generateTCGAMatrixCreationScripts.ipynb.out")
    print("create dir: {}".format(outDirPath))
    outDirPath.mkdir( parents=True, exist_ok=True )
    
    colDataFilePath = outDirPath.joinpath(workspaceName + "_colData.csv")
    colDataDF.to_csv(colDataFilePath, index=False)
    print("wrote file: {}".format(colDataFilePath))
    
    scriptpPath = outDirPath.joinpath(workspaceName + "_copyFromTerraToNativeGCP.sh")
    with open(scriptpPath, 'w') as fp:
        for cmd in shellScriptList:
            fp.write("{}\n".format(cmd))   
        
    print("wrote file: {}".format(scriptpPath))

    

def testRun():
    workspaceName = "TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab"
    quantColName, sampleDF = dataDict[workspaceName]
    print("workspaceName: {} quantColName: {}".format(workspaceName, quantColName))    
    run(rootDir, DESTINATION_BUCKET_ID, workspaceName, quantColName, sampleDF)
    
testRun()

workspaceName: TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab quantColName: quantFile3
number of mRNASeq_fastq_path files: 177
num passed:105
num failed:72
quantDF.shape:(105, 63)
colDataDF.shape:(105, 7)
create dir: ../../../terraDataModels/test-aedavids-proj/TCGA/TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/generateTCGAMatrixCreationScripts.ipynb.out
wrote file: ../../../terraDataModels/test-aedavids-proj/TCGA/TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/generateTCGAMatrixCreationScripts.ipynb.out/TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_colData.csv
wrote file: ../../../terraDataModels/test-aedavids-proj/TCGA/TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/generateTCGAMatrixCreationScripts.ipynb.out/TCGA_READ_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab_copyFromTerraToNativeGCP.sh


In [12]:
aedwip

def runAll(dataDict, skipWorkspaceList, dstBucketId):
    for workspaceName,t in dataDict.items():
        colName, df = t
        
        if workspaceName in skipWorkspaceList:
            print("\n***** skipping: {}".format(workspaceName))
            continue
                
        quantColName, sampleDF = dataDict[workspaceName]
        print("\n*******\nworkspaceName: {} quantColName: {}".format(workspaceName, quantColName))
        
        run(rootDir, dstBucketId, quantColName, sampleDF)
        
       


                
run( rootDir, dataDict, skipWorkspaceList, DESTINATION_BUCKET_ID)

NameError: name 'aedwip' is not defined